Webpdf
Webpdf
The R Series
Series Editors
John M. Chambers, Department of Statistics, Stanford University,
California, USA
Torsten Hothorn, Division of Biostatistics, University of Zurich,
Switzerland
Duncan Temple Lang, Department of Statistics, University of California,
Davis, USA
Hadley Wickham, RStudio, Boston, Massachusetts, USA
Geocomputation with R
Robin Lovelace, Jakub Nowosad, Jannes Muenchow
Learn R
Pedro J. Aphalo
As a Language
Pedro J. Aphalo
First edition published 2020
by CRC Press
6000 Broken Sound Parkway NW, Suite 300, Boca Raton, FL 33487-2742
Reasonable efforts have been made to publish reliable data and information, but the
author and publisher cannot assume responsibility for the validity of all materials or the
consequences of their use. The authors and publishers have attempted to trace the
copyright holders of all material reproduced in this publication and apologize to copyright
holders if permission to publish in this form has not been obtained. If any copyright
material has not been acknowledged please write and let us know so we may rectify in any
future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted,
reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other
means, now known or hereafter invented, including photocopying, microfilming, and
recording, or in any information storage or retrieval system, without written permission
from the publishers.
For permission to photocopy or use material electronically from this work, access
www.copyright.com or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. For works that are not available on CCC please
contact [email protected]
Preface xi
1.2 R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 What is R? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 R as a language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.2.1 Netiquette . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2.2 StackOverflow . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.12 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
2.13 Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
v
vi Contents
2.18 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Contents vii
viii Contents
Contents ix
x Contents
Bibliography 327
Preface
“Suppose that you want to teach the ‘cat’ concept to a very young child.
Do you explain that a cat is a relatively small, primarily carnivorous
mammal with retractible claws, a distinctive sonic output, etc.? I’ll bet
not. You probably show the kid a lot of different cats, saying ‘kitty’
each time, until it gets the idea. To put it more generally, generaliza-
tions are best made by abstraction from experience.”
R. P. Boas
Can we make mathematics intelligible?, 1981
This book covers different aspects of the use of the R language. Chapters 1 to 5
describe the R language itself. Later chapters describe extensions to the R language
available through contributed packages, the grammar of data and the grammar of
graphics. In this book, explanations are concise but contain pointers to additional
sources of information, so as to encourage the development of a routine of inde-
pendent exploration. This is not an arbitrary decision, this is the normal modus
operandi of most of us who use R regularly for a variety of different problems.
Some have called approaches like the one used here “learning the hard way,” but
I would call it “learning to be independent.”
I do not discuss statistics or data analysis methods in this book; I describe R
as a language for data manipulation and display. The idea is for you to learn the
R language in a way comparable to how children learn a language: they work out
what the rules are, simply by listening to people speak and trying to utter what
they want to tell their parents. Of course, small children receive some guidance,
but they are not taught a prescriptive set of rules like when learning a second
language at school. Instead of listening, you will read code, and instead of speaking,
you will try to execute R code statements on a computer—i.e., you will try your
hand at using R to tell a computer what you want it to compute. I do provide
explanations and guidance, but the idea of this book is for you to use the numerous
examples to find out by yourself the overall patterns and coding philosophy behind
the R language. Instead of parents being the sound board for your first utterances
in R, the computer will play this role. You will play by modifying the examples,
see how the computer responds: does R understand you or not? Using a language
actively is the most efficient way of learning it. By using it, I mean actually reading,
writing, and running scripts or programs (copying and pasting, or typing ready-
made examples from books or the internet, does not qualify as using a language).
xi
xii Preface
I have been using R since around 1998 or 1999, but I am still constantly learning
new things about R itself and R packages. With time, it has replaced in my work
as a researcher and teacher several other pieces of software: SPSS, Systat, Origin,
MS-Excel, and it has become a central piece of the tool set I use for producing
lecture slides, notes, books, and even web pages. This is to say that it is the most
useful piece of software and programming language I have ever learned to use. Of
course, in time it will be replaced by something better, but at the moment it is a
key language to learn for anybody with a need to analyze and display data.
What is a language? A language is a system of communication. R as a language
allows us to communicate with other members of the R community, and with com-
puters. As with all languages in active use, R evolves. New “words” and new “con-
structs” are incorporated into the language, and some earlier frequently used ones
are relegated to the fringes of the corpus. I describe current usage and “modisms”
of the R language in a way accessible to a readership unfamiliar with computer sci-
ence but with some background in data analysis as used in biology, engineering,
or the humanities.
When teaching, I tend to lean toward challenging students, rather than telling
an over-simplified story. There are two reasons for this. First, I prefer as a student,
and I learn best myself, if the going is not too easy. Second, if I would hide the tricky
bits of the R language, it would make the reader’s life much more difficult later on.
You will not remember all the details; nobody could. However, you most likely
will remember or develop a sense of when you need to be careful or should check
the details. So, I will expose you not only to the usual cases, but also to several
exceptions and counterintuitive features of the language, which I have highlighted
with icons. Reading this book will be about exploring a new world; this book aims
to be a travel guide, but neither a traveler’s account, nor a cookbook of R recipes.
Keep in mind that it is impossible to remember everything about R! The R lan-
guage, in a broad sense, is vast because its capabilities can be expanded with in-
dependently developed packages. Learning to use R consists of learning the basics
plus developing the skill of finding your way in R and its documentation. In early
2020, the number of packages available in the Comprehensive R Archive Network
(CRAN) broke the 15,000 barrier. CRAN is the most important, but not only, public
repository for R packages. How good a command of the R language and packages
a user needs depends on the type of activities to be carried out. This book at-
tempts to train you in the use of the R language itself, and of popular R language
extensions for data manipulation and graphical display. Given the availability of
numerous books on statistical analysis with R, in the present book I will cover
only the bare minimum of this subject. The same is true for package development
in R. This book is somewhere in-between, aiming at teaching programming in the
small: the use of R to automate the drudgery of data manipulation, including the
different steps spanning from data input and exploration to the production of
publication-quality illustrations.
As with all “rich” languages, there are many different ways of doing things in
R. In almost all cases there is no one-size-fits-all solution to a problem. There is
always a compromise involved, usually between time spent by the user and pro-
cessing time required in the computer. Many of the packages that are most popular
nowadays did not exist when I started using R, and many of these packages make
Preface xiii
new approaches available. One could write many different R books with a given
aim using substantially different ways of achieving the same results. In this book, I
limit myself to packages that are currently popular and/or that I consider elegantly
designed. I have in particular tried to limit myself to packages with similar design
philosophies, especially in relation to their interfaces. What is elegant design, and
in particular what is a friendly user interface, depends strongly on each user’s
preferences and previous experience. Consequently, the contents of the book are
strongly biased by my own preferences. I have tried to write examples in ways
that execute fast without compromising readability. I encourage readers to take
this book as a starting point for exploring the very many packages, styles, and
approaches which I have not described.
I will appreciate suggestions for further examples, and notification of errors
and unclear sections. Because the examples here have been collected from diverse
sources over many years, not all sources are acknowledged. If you recognize any
example as yours or someone else’s, please let me know so that I can add a proper
acknowledgement. I warmly thank the students who have asked the questions and
posed the problems that have helped me write this text and correct the mistakes
and voids of previous versions. I have also received help on online forums and in
person from numerous people, learned from archived e-mail list messages, blog
posts, books, articles, tutorials, webinars, and by struggling to solve some new
problems on my own. In many ways this text owes much more to people who are
not authors than to myself. However, as I am the one who has written this version
and decided what to include and exclude, as author, I take full responsibility for
any errors and inaccuracies.
Why have I chosen the title “Learn R: As a Language”? This book is based on ex-
ploration and practice that aims at teaching to express various generic operations
on data using the R language. It focuses on the language, rather than on specific
types of data analysis, and exposes the reader to current usage and does not spare
the quirks of the language. When we use our native language in everyday life, we
do not think about grammar rules or sentence structure, except for the trickier or
unfamiliar situations. My aim is for this book to help you grow to use R in this
same way, to become fluent in R. The book is structured around the elements of
languages with chapter titles that highlight the parallels between natural languages
like English and the R language.
I encourage you to approach R like a child approaches his or her mother tongue
when first learning to speak: do not struggle, just play, and fool around with R! If
the going gets difficult and frustrating, take a break! If you get a new insight, take
a break to enjoy the victory!
Acknowledgements
First I thank Jaakko Heinonen for introducing me to the then new R. Along the
way many well known and not so famous experts have answered my questions
in usenet and more recently in Stackoverflow. As time went by, answering other
people’s questions, both in the internet and in person, became the driving force
xiv Preface
for me to delve into the depths of the R language. Of course, I still get stuck from
time to time and ask for help. I wish to warmly thank all the people I have inter-
acted with in relation to R, including members of my own research group, students
participating in the courses I have taught, colleagues I have collaborated with, au-
thors of the books I have read and people I have only met online or at conferences.
All of them have made it possible for me to write this book. This has been a time
consuming endeavour which has kept me too many hours away from my family,
so I specially thank Tarja, Rosa and Tomás for their understanding. I am indebted
to Tarja Lehto, Titta Kotilainen, Tautvydas Zalnierius, Fang Wang, Yan Yan, Neha
Rai, Markus Laurel, other colleagues, students and anonymous reviewers for many
very helpful comments on different versions of the book manuscript, Rob Calver,
as editor, for his encouragement and patience during the whole duration of this
book writing project, Lara Spieker, Vaishali Singh, and Paul Boyd for their help with
different aspects of this project.
U Signals advanced playground boxes which will require more time to play
with before grasping concepts than regular playground boxes.
Signals in-depth explanations of specific points that may require you to spend
time thinking, which in general can be skipped on first reading, but to which you
should return at a later peaceful time, preferably with a cup of coffee or tea.
= Signals text boxes providing general information not directly related to the R
language.
1
1.2 R
1.2.1 What is R?
Most people think of R as a computer program. R is indeed a computer program—
a piece of software— but it is also a computer language, implemented in the R
program. Does this make a difference? Yes. Until recently we had only one main-
stream implementation of R, the program R. Recently another implementation has
1
2 R: The language and the program
gained some popularity, Microsoft R Open (MRO), which is directly based on the
R program from The R Project for Statistical Computing. MRO is described as an
enhanced distribution of R. These two very similar implementations are not the
only ones available, but others are not in widespread use. In other words, the R
language can be used not only in the R program, and it is feasible that other im-
plementations will be developed in the future.
The name “base R ” is used to distinguish R itself, as in the R distribution, from
R in a broader sense, which includes independently developed extensions that can
be loaded from separately distributed extension packages.
Being that R is essentially a command-line application, it can be used on what
nowadays are frugal computing resources, equivalent to a personal computer of
three decades ago. R can run even on the Raspberry Pi, a micro-controller board
with the processing power of a modest smart phone. At the other end of the spec-
trum, on really powerful servers, R can be used for the analysis of big data sets
with millions of observations. How powerful a computer you will need will depend
on the size of the data sets you want to analyze, on how patient you are, and on
your ability to write “good” code.
One could think of R as a dialect of an earlier language, called S. S evolved
into S-Plus (Becker et al. 1988). S and S-Plus are commercial programs, and varia-
tions in the language appeared only between versions. R started as a poor man’s
home-brewed implementation of S, for use in teaching. Initially R, the program, im-
plemented a subset of the S language. The R program evolved until only relatively
few differences between S and R remained, and these differences are intentional—
thought of as significant improvements. As R overtook S-Plus in popularity, some
of the new features in R made their way back into S-Plus. R is free and open-source
and the name Gnu S is sometimes used to refer to R.
What makes R different from SPSS, SAS, etc., is that S was designed from the
start as a computer programming language. This may look unimportant for some-
one not actually needing or willing to write software for data analysis. However, in
reality it makes a huge difference because R is easily extensible. By this we mean
that new functionality can be easily added, and shared, and this new functional-
ity is to the user indistinguishable from that built into R. In other words, instead
of having to switch between different pieces of software to do different types of
analyses or plots, one can usually find an R package that will provide the tools to
do the job within R. For those routinely doing similar analyses the ability to write
a short program, sometimes just a handful of lines of code, allows automation of
routine analyses. For those willing to spend time programming, they have the door
open to building the tools they need when these do not already exist.
However, the most important advantage of using a language like R is that it
makes it easy to do data analyses in a way that ensures that they can be exactly
repeated. In other words, the biggest advantage of using R, as a language, is not
in communicating with the computer, but in communicating to other people what
has been done, in a way that is unambiguous. Of course, other people may want to
run the same commands in another computer, but still it means that a translation
from a set of instructions to the computer into text readable to humans—say the
materials and methods section of a paper—and back is avoided together with the
ambiguities usually creeping in.
R 3
1.2.2 R as a language
R is a computer language designed for data analysis and data visualization, how-
ever, in contrast to some other scripting languages, it is, from the point of view of
computer programming, a complete language—it is not missing any important fea-
ture. In other words, no fundamental operations or data types are lacking (Cham-
bers 2016). I attribute much of its success to the fact that its design achieves a
very good balance between simplicity, clarity and generality. R excels at generality
thanks to its extensibility at the cost of only a moderate loss of simplicity, while
clarity is ensured by enforced documentation of extensions and support for both
object-oriented and functional approaches to programming. The same three prin-
ciples can be also easily respected by user code written in R.
As mentioned above, R started as a free and open-source implementation of the
S language (Becker and Chambers 1984; Becker et al. 1988). We will describe the
features of the R language in later chapters. Here I mention, for those with pro-
gramming experience, that it does have some features that make it different from
other frequently used programming languages. For example, R does not have the
strict type checks of Pascal or C++. It has operators that can take vectors and ma-
trices as operands allowing more concise program statements for such operations
than other languages. Writing programs, specially reliable and fast code, requires
familiarity with some of these idiosyncracies of the R language. For those using R
interactively, or writing short scripts, these idiosyncratic features make life a lot
easier by saving typing.
Some languages have been standardized, and their grammar has been
formally defined. R, in contrast is not standardized, and there is no formal
grammar definition. So, the R language is defined by the behavior of the R
program.
FIGURE 1.1
The R console where the user can type textual commands one by one. Here the user
has typed print("Hello") and entered it by ending the line of text by pressing the
“enter” key. The result of running the command is displayed below the command.
The character at the head of the input line, a “>” in this case, is called the command
prompt, signaling where a command can be typed in. Commands entered by the
user are displayed in red, while results returned by R are displayed in blue.
in commands one by one, we say that we use R interactively. When we run script,
we may say that we run a “batch job.”
The two approaches described above are part of the R program by itself. How-
ever, it is common to use a second program as a front-end or middleman between
the user and the R program. Such a program allows more flexibility and has multi-
ple features that make entering commands or writing scripts easier. Computations
are still done by exactly the same R program. The simplest option is to use a text
editor like Emacs to edit the scripts and then run the scripts in R from within the
editor. With some editors like Emacs, rather good integration is possible. However,
nowadays there are also Integrated Development Environments (IDEs) available for
R. An IDE both gives access to the R console in one window and provides a text
editor for writing scripts in another window. Of the available IDEs for R, RStudio
is currently the most popular by a wide margin.
FIGURE 1.2
The R console embedded in RStudio. The same commands have been typed in as
in Figure 1.1. Commands entered by the user are displayed in purple, while results
returned by R are displayed in black.
FIGURE 1.3
The R console after several commands have been entered. Commands entered by
the user are displayed in red, while results returned by R are displayed in blue.
dialogue between user and R. The console can look different when displayed within
an IDE like RStudio, but the only difference is in the appearance of the text rather
than in the text itself (cf. Figures 1.1 and 1.2).
The two previous figures showed the result of entering a single command. Fig-
ure 1.3 shows how the console looks after the user has entered several commands,
each as a separate line of text.
The examples in this book require only the console window for user input.
Menu-driven programs are not necessarily bad, they are just unsuitable when there
is a need to set very many options and choose from many different actions. They
are also difficult to maintain when extensibility is desired, and when independently
developed modules of very different characteristics need to be integrated. Textual
languages also have the advantage, to be addressed in later chapters, that com-
mand sequences can be stored in human- and computer-readable text files. Such
files constitute a record of all the steps used, and in most cases, makes it trivial to
reproduce the same steps at a later time. Scripts are a very simple and handy way
of communicating to other users how to do a given data analysis.
In the console one types commands at the > prompt. When one ends a
line by pressing the return or enter key, if the line can be interpreted as an R
command, the result will be printed at the console, followed by a new > prompt.
6 R: The language and the program
FIGURE 1.4
Screen capture of the R console and editor just after running a script. The upper
pane shows the R console, and the lower pane, the script file in an editor.
When working at the command prompt, most results are printed by de-
fault. However, within scripts one needs to use function print() explicitly
when a result is to be displayed.
A true “batch job” is not run at the R console but at the operating system com-
mand prompt, or shell. The shell is the console of the operating system—Linux,
Unix, OS X, or MS-Windows. Figure 1.5 shows how running a script at the Windows
R 7
FIGURE 1.5
Screen capture of the MS-Windows command console just after running the same
script. Here we use Rscript to run the script; the exact syntax will depend on the
operating system in use. In this case, R prints the results at the operating system
console or shell, rather than in its own R console.
command prompt looks. A script can be run at the operating system prompt to
do time-consuming calculations with the output saved to a file. One may use this
approach on a server, say, to leave a large data analysis job running overnight or
even for several days.
FIGURE 1.6
The RStudio interface just after running the same script. Here we used the “Source”
button to run the script. In this case, R prints the results to the R console in the
lower left pane.
button at the top of the editor pane. RStudio, in response to this, generated the
code needed to source the file and “entered” it at the console, the same console,
where we would type any R commands.
When a script is run, if an error is triggered, RStudio automatically finds the
location of the error. RStudio supports the concept of projects allowing saving of
settings per project. Some features are beyond what you need for everyday data
analysis and aimed at package development, such as integration of debugging,
traceback on errors, profiling and bench marking of code so as to analyze and
improve performance. It integrates support for file version control, which is not
only useful for package development, but also for keeping track of the progress
or collaboration in the analysis of data.
The version of RStudio that one uses locally, i.e., installed in a computer used
locally by a single user, runs with an almost identical user interface on most mod-
ern operating systems, such as Linux, Unix, OS X, and MS-Windows. There is also
a server version that runs on Linux, and that can be used remotely through a web
browser. The user interface is still the same.
RStudio is under active development, and constantly improved. Visit http:
//www.rstudio.org/ for an up-to-date description and download and installa-
tion instructions. Two books (Hillebrand and Nierhoff 2015; Loo and Jonge 2012)
describe and teach how to use RStudio without going in depth into data analysis or
statistics, however, as RStudio is under very active development, several recently
added important features are not described in these books. You will find tutorials
and up-to-date cheat sheets at http://www.rstudio.org/.
Reproducible data analysis 9
with R, ‘knitr’ and LATEX. All pages in the book are generated directly, all figures are
generated by R and included automatically, except for the figures in this chapter
that have been manually captured from the computer screen. Why am I using this
approach? First because I want to make sure that every bit of code as you will see
printed, runs without error. In addition, I want to make sure that the output that
you will see below every line or chunk of R language code is exactly what R returns.
Furthermore, it saves a lot of work for me as author, as I can just update R and all
the packages used to their latest version, and build the book again, to keep it up
to date and free of errors.
Although the use of these tools is important, they are outside the scope of this
book and well described in other books (Gandrud 2015; Xie 2013). Still when writ-
ing code, using a consistent style for formatting and indentation, carefully choos-
ing variable names, and adding textual explanations in comments when needed,
helps very much with readability for humans. I have tried to be as consistent as
possible throughout the whole book in this respect, with only small personal de-
viations from the usual style.
Error messages tend to be terse in R, and may require some lateral think-
ing and/or “experimentation” to understand the real cause behind problems.
When you are not sure you understand how some command works, it is useful
in many cases to try simple examples for which you know the correct answer
and see if you can reproduce them with R. Because of this, this book includes
some code examples that trigger errors. Learning to interpret error messages
is part of what is needed to become a proficient user of R. To test your un-
derstanding of how a code statement or function works, it is good to try your
Finding additional information 11
hand at testing its limits, testing which variations of a piece code are valid or
not.
help("sum")
?sum
U Look at help for some other functions like mean(), var(), plot() and, why
not, help() itself!
help(help)
When using RStudio there are easier ways of navigating to a help page than
using function help(), for example, with the cursor on the name of a function in
the editor or console, pressing the F1 key opens the corresponding help page in the
help pane. Letting the cursor hover for a few seconds over the name of a function
at the R console will open “bubble help” for it. If the function is defined in a script
or another file that is open in the editor pane, one can directly navigate from the
line where the function is called to where it is defined. In RStudio one can also
search for help through the graphical interface.
In addition to help pages, R’s distribution includes useful manuals as PDF or
HTML files. These can be accessed most easily through the Help menu in RStudio
or RGUI. Extension packages provide help pages for the functions and data they
export. When a package is loaded into an R session, its help pages are added to
the native help of R. In addition to these individual help pages, each package pro-
vides an index of its corresponding help pages for users to browse. Many packages,
contain vignettes such as User Guides or articles describing the algorithms used.
There are some web sites that give access to R documentation through a web
server. These sites can be very convenient when exploring whether a certain pack-
age could be useful for a certain problem, as they allow browsing and searching
the documentation without need of installing the packages. Some package main-
tainers have web sites with additional documentation for their own packages. The
DESCRIPTION or README of packages provide contact information for the main-
tainer, links to web sites, and instructions on how to report bugs. As packages are
contributed by independent authors, they should be cited in addition to citing R
12 R: The language and the program
itself. R function citation() when called with the name of a package as its argu-
ment provides the reference that should be cited for the package, and without an
explicit argument, the reference to cite for the version of R in use.
citation()
##
##
## URL https://www.R-project.org/.
##
##
## @Manual{,
## year = {2020},
## url = {https://www.R-project.org/},
## }
##
## citing R packages.
U Look at the help page for function citation() for a discussion of why it
is important for users to cite R and packages when using them.
1.4.2.1 Netiquette
In most internet forums, a certain behavior is expected from those asking and
answering questions. Some types of misbehavior, like use of offensive or inap-
propriate language, will usually result in the user losing writing rights in a forum.
Occasional minor misbehavior, will usually result in the original question not being
answered and instead the problem highlighted in the reply. In general following
the steps listed below will greatly increase your chances of getting a detailed and
useful answer.
• Do your homework: first search for existing answers to your question, both
Finding additional information 13
online and in the documentation. (Do mention that you attempted this without
success when you post your question.)
• Provide a clear explanation of the problem, and all the relevant information.
Say if it concerns R, the version, operating system, and any packages loaded
and their versions.
• If at all possible, provide a simplified and short, but self-contained, code exam-
ple that reproduces the problem (sometimes called reprex).
• Be polite.
• Contribute to the forum by answering other users’ questions when you know
the answer.
1.4.2.2 StackOverflow
Nowadays, StackOverflow (http://stackoverflow.com/) is the best question-
and-answer (Q & A) support site for R. In most cases, searching for existing ques-
tions and their answers, will be all that you need to do. If asking a question, make
sure that it is really a new question. If there is some question that looks similar,
make clear how your question is different.
StackOverflow has a user-rights system based on reputation, and questions and
answers can be up- and down-voted. Those with the most up-votes are listed at the
top of searches. If the questions or answers you write are up-voted, after you ac-
cumulate enough reputation, you acquire badges and rights, such as editing other
users’ questions and answers or later on, even deleting wrong answers or off-topic
questions from the system. This sounds complicated, but works extremely well
at ensuring that the base of questions and answers is relevant and correct, with-
out relying on nominated moderators. When using StackOverflow, do contribute by
accepting correct answers, up-voting questions and answers that you find useful,
down-voting those you consider poor, and flagging or correcting errors you may
discover.
a data set included in base R or generate artificial data within the reprex code.
If you can reproduce the problem only with your own data, then you need to
provide a minimal subset of it that triggers the problem.
While preparing the reprex you will need to simplify the code, and some-
times this step allows you to diagnose the problem. Always, before posting a
reprex online, it is wise to check it with the latest versions of R and any package
being used.
I would say that about two out of three times I prepare a reprex, it allows
me to much better understand the problem and find the root of the problem
and a solution or a work-around on my own.
package contains installation instructions and saved lists of the names of all other
packages used in the book. Instructions on installing R, Git, RStudio, compilers
and other tools are available online. In many cases the IT staff at your employer or
school will know how to install them, or they may even be included in the default
computer setup. In addition, a web site supporting the book will be available at:
http://www.learnr-book.info.
Howard Aiken
Proposed automatic calculating machine, 1937; reprinted 1964
17
18 The R language: “Words” and “sentences”
(3 + exp(2)) / sin(pi)
## [1] 8.483588e+16
It can be seen above that mathematical constants and functions are part of the
R language. One thing to remember when translating complex fractions as above
into R code, is that in arithmetic expressions the bar of the fraction generates
a grouping that alters the normal precedence of operations. In contrast, in an R
expression this grouping must be explicitly signaled with additional parentheses.
If you are in doubt about how precedence rules work, you can add parentheses
to make sure the order of computations is the one you intend. Redundant paren-
theses have no effect.
1 + 2 * 3
## [1] 7
1 + (2 * 3)
## [1] 7
(1 + 2) * 3
## [1] 9
The number of opening (left side) and closing (right side) parentheses must be
balanced, and they must be located so that each enclosed term is a valid mathemat-
ical expression. For example, while (1 + 2) * 3 is valid, (1 +) 2 * 3 is a syntax
error as 1 + is incomplete and cannot be calculated.
U Here results are not shown. These are examples for you to type at the
command prompt. In general you should not skip them, as in many cases, as
with the statements highlighted with comments in the code chunk below, they
have something to teach or demonstrate. You are strongly encouraged to play,
in other words, create new variations of the examples and execute them to
explore how R works.
1 + 1
2 * 2
2 + 10 / 5
(2 + 10) / 5
10^2 + 1
sqrt(9)
log(100)
log10(100)
log2(8)
exp(1)
Variables a and A are two different variables. Variable names can be long in R al-
though it is not a good idea to use very long names. Here I am using very short
names, something that is usually also a very bad idea. However, in the examples
in this chapter where the stored values have no connection to the real world, sim-
ple names emphasize their abstract nature. In the chunk below, a and b are ar-
bitrarily chosen variable names; I could have used names like my.variable.a or
outside.temperature if they had been useful to convey information.
a <- 1
a + 1
## [1] 2
a
## [1] 1
b <- 10
b <- a + b
b
## [1] 11
3e-2 * 2.0
## [1] 0.06
Entering the name of a variable at the R console implicitly calls function print()
displaying the stored value on the console. The same applies to any other state-
ment entered at the R console: print() is implicitly called with the result of exe-
cuting the statement as its argument.
a
## [1] 1
print(a)
## [1] 1
a + 1
## [1] 2
print(a + 1)
## [1] 2
U There are some syntactically legal statements that are not very frequently
used, but you should be aware that they are valid, as they will not trigger error
messages, and may surprise you. The most important thing is to write code
consistently. The “backwards” assignment operator -> and resulting code like
1 -> a are valid but less frequently used. The use of the equals sign (=) for
assignment in place of <- although valid is discouraged. Chaining assignments
as in the first statement below can be used to signal to the human reader that
a, b and c are being assigned the same value.
Numeric values and arithmetic 21
In R, all numbers belong to mode numeric (we will discuss the concepts
of mode and class in section 2.8 on page 41). We can query if the mode of an
object is numeric with function is.numeric().
mode(1)
## [1] "numeric"
a <- 1
is.numeric(a)
## [1] TRUE
is.numeric(1L)
## [1] TRUE
is.integer(1L)
## [1] TRUE
is.double(1L)
## [1] FALSE
is.numeric(1)
## [1] TRUE
is.integer(1)
## [1] FALSE
is.double(1)
## [1] TRUE
The name double originates from the C language, in which there are dif-
ferent types of floats available. With the name double used to mean “double-
precision floating-point numbers.” Similarly, the use of L stems from the long
type in C, meaning “long integer numbers.”
Numeric variables can contain more than one value. Even single numbers are
in R vector s of length one. We will later see why this is important. As you have
seen above, the results of calculations were printed preceded with [1]. This is the
index or position in the vector of the first number (or other value) displayed at the
head of the current line.
One can use c() “concatenate” to create a vector from other vectors, including
vectors of length 1, such as the numeric constants in the statements below.
a <- c(3, 1, 2)
a
## [1] 3 1 2
b <- c(4, 5, 0)
b
## [1] 4 5 0
c <- c(a, b)
c
## [1] 3 1 2 4 5 0
d <- c(b, a)
d
## [1] 4 5 0 3 1 2
Method c() accepts as arguments two or more vectors and concatenates them,
one after another. Quite frequently we may need to insert one vector in the middle
of another. For this operation, c() is not useful by itself. One could use indexing
combined with c(), but this is not needed as R provides a function capable of
directly doing this operation. Although it can be used to “insert” values, it is named
append(), and by default, it indeed appends one vector at the end of another.
append(a, b)
## [1] 3 1 2 4 5 0
The output above is the same as for c(a, b), however, append() accepts as an
argument an index position after which to “append” its second argument. This
results in an insert operation when the index points at any position different from
the end of the vector.
Numeric values and arithmetic 23
U One can create sequences using function seq() or the operator :, or repeat
values using function rep(). In this case, I leave to the reader to work out
the rules by running these and his/her own examples, with the help of the
documentation, available through help(seq) and help(rep).
a <- -1:5
a
b <- 5:-1
b
c <- seq(from = -1, to = 1, by = 0.1)
c
d <- rep(-5, 4)
d
Next, something that makes R different from most other programming lan-
guages: vectorized arithmetic. Operators and functions that are vectorized accept,
as arguments, vectors of arbitrary length, in which case the result returned is equiv-
alent to having applied the same function or operator individually to each element
of the vector.
(a + 1) * 2
## [1] 8 4 6
a + b
## [1] 7 6 2
a - a
## [1] 0 0 0
As it can be seen in the first line above, another peculiarity of R, is what is fre-
quently called “recycling” of arguments: as vector a is of length 6, but the constant
1 is a vector of length 1, this short constant vector is extended, by recycling its
value, into a vector of six ones—i.e., a vector of the same length as the longest
vector in the statement, a.
Make sure you understand what calculations are taking place in the chunk
above, and also the one below.
a <- rep(1, 6)
a
## [1] 1 1 1 1 1 1
a + 1:2
24 The R language: “Words” and “sentences”
## [1] 2 3 2 3 2 3
a + 1:3
## [1] 2 3 4 2 3 4
a + 1:4
A useful thing to know: a vector can have length zero. Vectors of length
zero may seem at first sight quite useless, but in fact they are very useful. They
allow the handling of “no input” or “nothing to do” cases as normal cases,
which in the absence of vectors of length zero would require to be treated as
special cases. I describe here a useful function, length() which returns the
length of a vector or list.
z <- numeric(0)
## numeric(0)
length(z)
## [1] 0
length(c(a, b))
## [1] 9
Many functions, such as R’s maths functions and operators, will accept nu-
meric vectors of length zero as valid input, returning also a vector of length
zero, issuing neither a warning nor an error message. In other words, these are
valid operations in R.
log(numeric(0))
## numeric(0)
5 + numeric(0)
## numeric(0)
It is possible to remove variables from the workspace with rm(). Function ls()
returns a list of all objects visible in the current environment, or by supplying a
Numeric values and arithmetic 25
pattern argument, only the objects with names matching the pattern. The pattern
is given as a regular expression, with [] enclosing alternative matching characters,
^ and $, indicating the extremes of the name (start and end, respectively). For
example, "^z$" matches only the single character ‘z’ while "^z" matches any name
starting with ‘z’. In contrast "^[zy]$" matches both ‘z’ and ‘y’ but neither ‘zy’ nor
‘yz’, and "^[a-z]" matches any name starting with a lowercase ASCII letter. If you
are using RStudio, all objects are listed in the Environment pane, and the search
box of the panel can be used to find a given object.
ls(pattern="^z$")
## [1] "z"
rm(z)
ls(pattern="^z$")
## character(0)
There are some special values available for numbers. NA meaning “not available”
is used for missing values. Calculations can also yield the following values NaN “not
a number”, Inf and -Inf for ∞ and −∞. As you will see below, calculations yielding
these values do not trigger errors or warnings, as they are arithmetically valid. Inf
and -Inf are also valid numerical values for input and constants.
a <- NA
a
## [1] NA
-1 / 0
## [1] -Inf
1 / 0
## [1] Inf
Inf / Inf
## [1] NaN
Inf + 4
## [1] Inf
b <- -Inf
b * -1
## [1] Inf
Not available (NA) values are very important in the analysis of experimental data,
as frequently some observations are missing from an otherwise complete data set
due to “accidents” during the course of an experiment. It is important to under-
stand how to interpret NA’s. They are simple placeholders for something that is
unavailable, in other words, unknown.
A <- NA
A
## [1] NA
A + 1
## [1] NA
A + Inf
## [1] NA
26 The R language: “Words” and “sentences”
U When to use vectors of length zero, and when NAs? Make sure you un-
derstand the logic behind the different behavior of functions and operators
with respect to NA and numeric() or its equivalent numeric(0). What do they
represent? Why NA s are not ignored, while vectors of length zero are?
123 + numeric()
123 + NA
Model answer: NA is used to signal a value that “was lost” or “was expected”
but is unavailable because of some accident. A vector of length zero, repre-
sents no values, but within the normal expectations. In particular, if vectors
are expected to have a certain length, or if index positions along a vector are
meaningful, then using NA is a must.
Any operation, even tests of equality, involving one or more NA’s return an NA.
In other words, when one input to a calculation is unknown, the result of the cal-
culation is unknown. This means that a special function is needed for testing for
the presence of NA values.
is.na(c(NA, 1))
## [1] TRUE FALSE
In the example above, we can also see that is.na() is vectorized, and that it
applies the test to each of the two elements of the vector individually, returning
the result as a logical vector of length two.
One thing to be aware of are the consequences of the fact that numbers in
computers are almost always stored with finite precision and/or range: the expec-
tations derived from the mathematical definition of Real numbers are not always
fulfilled. See the box on page 33 for an in-depth explanation.
1 - 1e-20
## [1] 1
When comparing integer values these problems do not exist, as integer arith-
metic is not affected by loss of precision in calculations restricted to integers.
Because of the way integers are stored in the memory of computers, within the
representable range, they are stored exactly. One can think of computer integers
as a subset of whole numbers restricted to a certain range of values.
1L + 3L
## [1] 4
1L * 3L
## [1] 3
1L %/% 3L
## [1] 0
1L %% 3L
## [1] 1
1L / 3L
## [1] 0.3333333
Numeric values and arithmetic 27
The last statement in the example immediately above, using the “usual” division
operator yields a floating-point double result, while the integer division operator
%/% yields an integer result, and %% returns the remainder from the integer divi-
sion. If as a result of an operation the result falls outside the range of representable
values, the returned value is NA.
1000000L * 1000000L
Both doubles and integers are considered numeric. In most situations, conver-
sion is automatic and we do not need to worry about the differences between these
two types of numeric values. The next chunk shows returned values that are either
TRUE or FALSE. These are logical values that will be discussed in the next section.
is.numeric(1L)
## [1] TRUE
is.integer(1L)
## [1] TRUE
is.double(1L)
## [1] FALSE
is.double(1L / 3L)
## [1] TRUE
is.numeric(1L / 3L)
## [1] TRUE
U Study the variations of the previous example shown below, and ex-
plain why the two statements return different values. Hint: 1 is a double con-
stant. You can use is.integer() and is.double() in your explorations.
1 * 1000000L * 1000000L
1000000L * 1000000L * 1
round(0.0124567, digits = 3)
## [1] 0.012
signif(0.0124567, digits = 3)
28 The R language: “Words” and “sentences”
## [1] 0.0125
round(1789.1234, digits = 3)
## [1] 1789.123
signif(1789.1234, digits = 3)
## [1] 1790
a <- 0.12345
b <- round(a, digits = 2)
a == b
## [1] FALSE
a - b
## [1] 0.00345
b
## [1] 0.12
Being digits, the second parameter of these functions, the argument can
also be passed by position. However, code is usually easier to understand for
humans when parameter names are made explicit.
round(0.0124567, digits = 3)
## [1] 0.012
round(0.0124567, 3)
## [1] 0.012
• Explore how trunc() and ceiling() differ. Test them both with positive and
negative values.
• Advanced Use function abs() and operators + and - to reproduce the out-
put of trunc() and ceiling() for the different inputs.
a
## [1] TRUE
!a # negation
## [1] FALSE
a || b # logical OR
## [1] TRUE
xor(a, b) # exclusive OR
## [1] TRUE
b <- c(TRUE,TRUE)
a | b # vectorized OR
a || b # not vectorized
## [1] TRUE
30 The R language: “Words” and “sentences”
Functions any() and all() take zero or more logical vectors as their arguments,
and return a single logical value “summarizing” the logical values in the vectors.
Function all() returns TRUE only if all values in the vectors passed as arguments
are TRUE, and any() returns TRUE unless all values in the vectors are FALSE.
any(a)
## [1] TRUE
all(a)
## [1] FALSE
any(a & b)
## [1] TRUE
all(a & b)
## [1] FALSE
Another important thing to know about logical operators is that they “short-
cut” evaluation. If the result is known from the first part of the statement, the rest
of the statement is not evaluated. Try to understand what happens when you enter
the following commands. Short-cut evaluation is useful, as the first condition can
be used as a guard protecting a later condition from being evaluated when it would
trigger an error.
TRUE || NA
## [1] TRUE
FALSE || NA
## [1] NA
TRUE && NA
## [1] NA
FALSE && NA
## [1] FALSE
When using the vectorized operators on vectors of length greater than one,
‘short-cut’ evaluation still applies for the result obtained at each index position.
a & b & NA
## [1] NA FALSE
a | b | c(NA, NA)
## [1] TRUE TRUE
Comparison operators and operations 31
x & FALSE
x | c(TRUE, FALSE)
1.2 != 1.0
## [1] TRUE
a <- 20
a < 100 && a > 10
## [1] TRUE
a <- 1:10
a > 5
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
a < 5
## [1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
a == 5
## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
all(a > 5)
## [1] FALSE
any(a > 5)
## [1] TRUE
b <- a > 5
b
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
any(b)
## [1] TRUE
all(b)
## [1] FALSE
Precedence rules also apply to comparison operators and they can be overrid-
den by means of parentheses.
a > 2 + 3
## [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
(a > 2) + 3
## [1] 3 3 4 4 4 4 4 4 4 4
a <- 1:10
a > 3 | a + 2 < 3
Again, be aware of “short-cut evaluation”. If the result does not depend on the
missing value, then the result, TRUE or FALSE is returned. If the presence of the NA
makes the end result unknown, then NA is returned.
all(c > 5)
## [1] FALSE
any(c > 5)
## [1] TRUE
## [1] NA
## [1] NA
is.na(a)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
is.na(c)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
any(is.na(c))
## [1] TRUE
all(is.na(c))
## [1] FALSE
The behavior of many of base-R’s functions when NAs are present in their in-
put arguments can be modified. TRUE passed as an argument to parameter na.rm,
results in NA values being removed from the input before the function is applied.
Here I give some examples for which the finite resolution of com-
puter machine floats, as compared to Real numbers as defined in mathematics,
can cause serious problems. In R, numbers that are not integers are stored as
double-precision floats. In addition to having limits to the largest and small-
est numbers that can be represented, the precision of floats is limited by the
number of significant digits that can be stored. Precision is usually described
by “epsilon” (𝜖), abbreviated eps, defined as the largest value of 𝜖 for which
1 + 𝜖 = 1. The finite resolution of floats can lead to unexpected results when
testing for equality. In the second example below, the result of the subtraction
is still exactly 1 due to insufficient resolution.
34 The R language: “Words” and “sentences”
0 - 1e-20
## [1] -1e-20
1 - 1e-20
## [1] 1
The finiteness of floats also affects tests of equality, which is more likely to
result in errors with important consequences.
1e20 == 1 + 1e20
## [1] TRUE
1 == 1 + 1e-20
## [1] TRUE
0 == 1e-20
## [1] FALSE
.Machine$double.eps
## [1] 2.220446e-16
.Machine$double.neg.eps
## [1] 1.110223e-16
.Machine$double.max
## [1] 1024
.Machine$double.min
## [1] -1022
The last two values refer to the exponents of 10, rather than the maximum
and minimum size of numbers that can be handled as objects of class double.
Values outside these limits are stored as -Inf or Inf and enter arithmetic as
infinite values according the mathematical rules.
1e1026
## [1] Inf
1e-1026
## [1] 0
Inf + 1
## [1] Inf
-Inf + 1
## [1] -Inf
Comparison operators and operations 35
.Machine$integer.max
## [1] 2147483647
2147483699L
## [1] 2147483699
In those statements in the chunk below where at least one operand is double
the integer operands are promoted to double before computation. A similar
promotion does not take place when operations are among integer values,
resulting in overflow, meaning numbers that are too big to be represented as
integer values.
2147483600L + 99L
2147483600L + 99
## [1] 2147483699
2147483600L * 2147483600L
2147483600L * 2147483600
## [1] 4.611686e+18
We see next that the exponentiation operator ^ forces the promotion of its
arguments to double, resulting in no overflow. In contrast, as seen above, the
multiplication operator * operates on integers resulting in overflow.
2147483600L * 2147483600L
2147483600L^2L
## [1] 4.611686e+18
In many situations, when writing programs one should avoid testing for
equality of floating point numbers (‘floats’). Here we show how to gracefully
handle rounding errors. As the example shows, rounding errors may accumu-
late, and in practice .Machine$double.eps is not always a good value to safely
use in tests for “zero,” and a larger value may be needed. Whenever possible
according to the logic of the calculations, it is best to test for inequalities, for
36 The R language: “Words” and “sentences”
example using x <= 1.0 instead of x == 1.0. If this is not possible, then the
tests should be done replacing tests like x == 1.0 with abs(x - 1.0) < eps.
Function abs() returns the absolute value, in simpler words, makes all values
positive or zero, by changing the sign of negative values, or in mathematical
notation |𝑥| = | − 𝑥|.
sin(pi)
## [1] 1.224606e-16
sin(2 * pi)
## [1] -2.449213e-16
intersect(fruits, shopping)
intersect(bakery, shopping)
## [1] "bread"
intersect(dairy, shopping)
## [1] "butter" "cheese"
To test if a given value belongs to a set, we use operator %in%. In the algebra of
sets notation, this is written 𝑎 ∈ 𝐴, where 𝐴 is a set and 𝑎 a member. The second
statement shows that the %in% operator is vectorized on its left-hand-side (lhs)
operand, returning a logical vector.
Although inclusion is a set operation, it is also very useful for the simplification
of if()…else statements by replacing multiple tests for alternative constant values
of the same mode chained by multiple | operators.
unique(my.set)
## [1] "a" "b" "c"
In the notation used in algebra of sets, the set union operator is ∪ while the
intersection operator is ∩. If we have sets 𝐴 and 𝐵, their union is given by 𝐴∪𝐵—in
the next three examples, c("a", "a", "z") is a constant, while my.set is a variable.
U What do you expect to be the difference between the values returned by the
three statements in the code chunk below? Before running them, write down
your expectations about the value each one will return. Only then run the code.
Independently of whether your predictions were correct or not, write down an
explanation of what each statement’s operation is.
In the algebra of sets notation 𝐴 ⊆ 𝐵, where 𝐴 and 𝐵 are sets, indicates that
𝐴 is a subset or equal to 𝐵. For a true subset, the notation is 𝐴 ⊂ 𝐵. The opera-
tors with the reverse direction are ⊇ and ⊃. Implement these four operations
in four R statements, and test them on sets (represented by R vectors) with
different “overlap” among set members.
All set algebra examples above use character vectors and character con-
stants. This is just the most frequent use case. Sets operations are valid on
vectors of any atomic class, including integer, and computed values can be
part of statements. In the second and third statements in the next chunk, we
need to use additional parentheses to alter the default order of precedence
between arithmetic and set operators.
Character values 39
9L %in% 2L:4L
## [1] FALSE
Empty sets are an important component of the algebra of sets, in R they are
represented as vectors of zero length. Vectors and lists of zero length, which
the R language fully supports, can be used to “encode” emptiness also in other
contexts. These vectors do belong to a class such as numeric or character and
must be compatible with other operands in an expression. By default, construc-
tors for vectors, construct empty vectors.
length(integer())
## [1] 0
1L %in% integer()
## [1] FALSE
Although set operators are defined for numeric vectors, rounding errors in
‘floats’ can result in unexpected results (see section 2.5 on page 33). The next
two examples do, however, return the correct answers.
9 %in% (2:4)^2
## [1] TRUE
a <- "A"
## [1] "A"
b <- 'A'
## [1] "A"
a == b
## [1] TRUE
Concatenating character vectors of length one does not yield a longer character
string, it yields instead a longer vector.
a <- 'A'
b <- "bcdefg"
c <- "123"
d <- c(a, b, c)
d
## [1] "A" "bcdefg" "123"
Having two different delimiters available makes it possible to choose the type
of quotes used as delimiters so that other quotes can be included in a string.
a <- "He said 'hello' when he came in"
The outer quotes are not part of the string, they are “delimiters” used to mark
the boundaries. As you can see when b is printed special characters can be rep-
resented using “escape sequences”. There are several of them, and here we will
show just four, new line (\n) and tab (\t), \" the escape code for a quotation mark
within a string and \\ the escape code for a single backslash \. We also show here
the different behavior of print() and cat(), with cat() interpreting the escape
sequences and print() displaying them as entered.
c <- "abc\ndef\tx\"yz\"\\\tm"
print(c)
## [1] "abc\ndef\tx\"yz\"\\\tm"
cat(c)
## abc
## def x"yz"\ m
The ‘mode’ and ‘class’ of objects 41
The escape codes work only in some contexts, as when using cat() to generate
the output. For example, the new-line escape (\n) can be embedded in strings used
for axis-label, title or label in a plot to split them over two or more lines.
typeof(my_var)
## [1] "integer"
is.double(my_var)
## [1] FALSE
is.integer(my_var)
## [1] TRUE
is.logical(my_var)
## [1] FALSE
is.character(my_var)
## [1] FALSE
class(my_var)
## [1] "character"
inherits(my_var, "character")
## [1] TRUE
inherits(my_var, "numeric")
## [1] FALSE
as.character(1)
## [1] "1"
as.numeric("1")
## [1] 1
as.logical("TRUE")
## [1] TRUE
as.logical("NA")
## [1] NA
1 || 0
## [1] TRUE
FALSE | -2:2
## [1] TRUE TRUE FALSE TRUE TRUE
1
Except for some packages in the ‘tidyverse’ that use names starting with as_ instead of as..
‘Type’ conversions 43
as.character(3.0e10)
as.numeric("5E+5")
as.numeric("A")
as.numeric(TRUE)
as.numeric(FALSE)
as.logical("T")
as.logical("t")
as.logical("true")
as.logical(100)
as.logical(0)
as.logical(-1)
length(f)
## [1] 3
g <- "123"
length(g)
## [1] 1
as.numeric(f)
## [1] 1 2 3
as.numeric(g)
## [1] 123
Other functions relevant to the “conversion” of numbers and other values are
format(), and sprintf(). These two functions return character strings, instead of
numeric or other values, and are useful for printing output. One could think of
these functions as advanced conversion functions returning formatted, and pos-
sibly combined and annotated, character strings. However, they are usually not
considered normal conversion functions, as they are very rarely used in a way
that preserves the original precision of the input values. We show here the use of
format() and sprintf() with numeric values, but they can also be used with values
of other modes.
When using format(), the format used to display numbers is set by passing ar-
guments to several different parameters. As print() calls format() to make num-
bers pretty it accepts the same options.
44 The R language: “Words” and “sentences”
x = c(123.4567890, 1.0)
format(x) # using defaults
## [1] "123.4568" " 1.0000"
Function sprintf() is similar to C’s function of the same name. The user in-
terface is rather unusual, but very powerful, once one learns the syntax. All the
formatting is specified using a character string as template. In this template, place-
holders for data and the formatting instructions are embedded using special codes.
These codes start with a percent character. We show in the example below the use
of some of these: f is used for numeric values to be formatted according to a “fixed
point,” while g is used when we set the number of significant digits and e for ex-
ponential or scientific notation.
x = c(123.4567890, 1.0)
In the template "The numbers are: %4.2f and %.0f", there are two placehold-
ers for numeric values, %4.2f and %.0f, so in addition to the template, we pass two
values extracted from the first two positions of vector x. These could have been
two different vectors of length one, or even numeric constants. The template itself
does not need to be a character constant as in these examples, as a variable can
be also passed as argument.
and practice, by trying to create the same formatted output by means of the
two functions. Do also play with these functions with other types of data like
integer and character.
is.numeric(NA)
## [1] FALSE
is.character(NA)
## [1] FALSE
class(NA)
## [1] "logical"
a <- letters[1:10]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
a[2]
## [1] "b"
a[c(3,2)]
a[10:1]
## [1] "j" "i" "h" "g" "f" "e" "d" "c" "b" "a"
U The length of the indexing vector is not restricted by the length of the
indexed vector. However, only numerical indexes that match positions present
in the indexed vector can extract values. Those values in the indexing vector
pointing to positions that are not present in the indexed vector, result in NAs.
This is easier to learn by playing with R, than from explanations. Play with R,
using the following examples as a starting point.
length(a)
a[c(3,3,3,3)]
a[c(10:1, 1:10)]
a[c(1,11)]
a[11]
Have you tried some of your own examples? If not yet, do play with addi-
tional variations of your own before continuing.
Negative indexes have a special meaning; they indicate the positions at which
Vector manipulation 47
values should be excluded. Be aware that it is illegal to mix positive and negative
values in the same indexing operation.
a[-2]
## [1] "a" "c" "d" "e" "f" "g" "h" "i" "j"
a[-c(3,2)]
a[-3:-2]
U Results from indexing with special values and zero may be surprising.
Try to build a rule from the examples below, a rule that will help you remember
what to expect next time you are confronted with similar statements using
“subscripts” which are special values instead of integers larger or equal to
one—this is likely to happen sooner or later as these special values can be
returned by different R expressions depending on the value of operands or
function arguments, some of them described earlier in this chapter.
a[ ]
a[0]
a[numeric(0)]
a[NA]
a[c(1, NA)]
a[NULL]
a[c(1, NULL)]
Another way of indexing, which is very handy, but not available in most other
programming languages, is indexing with a vector of logical values. The logical
vector used for indexing is usually of the same length as the vector from which
elements are going to be selected. However, this is not a requirement, because if
the logical vector of indexes is shorter than the indexed vector, it is “recycled” as
discussed above in relation to other operators.
a[TRUE]
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
a[FALSE]
## character(0)
a[c(TRUE, FALSE)]
a[c(FALSE, TRUE)]
a > "c"
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
U The examples in this text box demonstrate additional uses of logical vec-
tors: 1) the logical vector returned by a vectorized comparison can be stored
in a variable, and the variable used as a “selector” for extracting a subset of
values from the same vector, or from a different vector.
a <- letters[1:10]
b <- 1:10
selector
a[selector]
b[selector]
indexes
a[indexes]
b[indexes]
Make sure to understand the examples above. These constructs are very
widely used in R because they allow for concise code that is easy to understand
once you are familiar with the indexing rules. However, if you do not command
these rules, many of these terse statements will be unintelligible to you.
a <- 1:10
a
## [1] 1 2 3 4 5 6 7 8 9 10
a[1] <- 99
a
## [1] 99 2 3 4 5 6 7 8 9 10
b <- 1:10
b[c(2,4)] <- -99 # recycling
b
## [1] 1 -99 3 -99 5 6 7 8 9 10
Vector manipulation 49
c <- 1:10
c[c(2,4)] <- c(-99, 99)
c
## [1] 1 -99 3 99 5 6 7 8 9 10
d <- 1:10
d[TRUE] <- 1 # recycling
d
## [1] 1 1 1 1 1 1 1 1 1 1
e <- 1:10
e <- 1 # no recycling
e
## [1] 1
We can also use subscripting on both sides of the assignment operator, for
example, to swap two elements.
a <- letters[1:10]
## [1] "b" "a" "c" "d" "e" "f" "g" "h" "i" "j"
U Do play with subscripts to your heart’s content, really grasping how they
work and how they can be used, will be very useful in anything you do in the
future with R. Even the contrived example below follows the same simple rules,
just study it bit by bit. Hint: the second statement in the chunk below, modifies
a, so, when studying variations of this example you will need to recreate a
by executing the first statement, each time you run a variation of the second
statement.
a <- letters[1:10]
a[5:1] <- a[c(TRUE,FALSE)]
a
## [1] "i" "g" "e" "c" "a" "f" "g" "h" "i" "j"
b <- LETTERS[1:10]
b[1]
## [1] "A"
b[1.1]
## [1] "A"
b[1.9999] # surprise!!
## [1] "A"
b[2]
## [1] "B"
From this experiment, we can learn that if positive indexes are not whole
numbers, they are truncated to the next smaller integer.
b <- LETTERS[1:10]
b[-1]
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
b[-1.1]
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
b[-1.9999]
## [1] "B" "C" "D" "E" "F" "G" "H" "I" "J"
b[-2]
## [1] "A" "C" "D" "E" "F" "G" "H" "I" "J"
From this experiment, we can learn that if negative indexes are not whole
numbers, they are truncated to the next larger (less negative) integer. In con-
clusion, double index values behave as if they where sanitized using function
trunc().
This example also shows how one can tease out of R its rules through ex-
perimentation.
sort(my.vector)
## [1] 1 4 4 10 22
## [1] 22 10 4 4 1
order(my.vector)
## [1] 4 2 5 1 3
Matrices and multidimensional arrays 51
my.vector[order(my.vector)]
## [1] 1 4 4 10 22
another.vector[order(my.vector)]
A problem linked to sorting that we may face is counting how many copies
of each value are present in a vector. We need to use two functions sort()
and rle() . The second of these functions computes run length as used in run
length encoding for which rle is an abbreviation. A run is a series of consecutive
identical values. As the objective is to count the number of copies of each value
present, we need first to sort the vector.
my.letters
## [1] "a" "e" "j" "c" "a" "d" "u" "a" "j"
sort(my.letters)
## [1] "a" "a" "a" "c" "d" "e" "j" "j" "u"
rle(sort(my.letters))
The second and third statements are only to demonstrate the effect of each
step. The last statement uses nested function calls to compute the number of
copies of each value in the vector.
matrix(1:15, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 6 11
## [2,] 2 7 12
52 The R language: “Words” and “sentences”
## [3,] 3 8 13
## [4,] 4 9 14
## [5,] 5 10 15
matrix(1:15, nrow = 3)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 4 7 10 13
## [2,] 2 5 8 11 14
## [3,] 3 6 9 12 15
U Check in the help page for the matrix constructor how to use the byrow
parameter to alter the default order in which the elements of the vector are
allocated to columns and rows of the new matrix.
help(matrix)
While you are looking at the help page, also consider the default number of
columns and rows.
matrix(1:15)
And to start getting a sense of how to interpret error and warning mes-
sages, run the code below and make sure you understand which problem is
being reported. Before executing the statement, analyze it and predict what
the returned value will be. Afterwards, compare your prediction, to the value
actually returned.
matrix(1:15, ncol = 2)
Subscripting of matrices and arrays is consistent with that used for vectors; we
only need to supply an indexing vector, or leave a blank space, for each dimension.
A matrix has two dimensions, so to access any element or group of elements, we
use two indices. The only complication is that there are two possible orders in
which, in principle, indexes could be supplied. In R, indexes for matrices are written
“row first.” In simpler words, the first index value selects rows, and the second one,
columns.
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
Matrices and multidimensional arrays 53
## [4,] 4 9 14 19
## [5,] 5 10 15 20
A[1, 1]
## [1] 1
A[1, ]
## [1] 1 6 11 16
A[ , 1]
## [1] 1 2 3 4 5
A[2:3, c(1,3)]
## [,1] [,2]
## [1,] 2 12
## [2,] 3 13
A[3, 4] <- 99
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 99
## [4,] 4 9 14 19
## [5,] 5 10 15 20
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 9 4 13 99
## [4,] 8 3 14 19
## [5,] 5 10 15 20
dim(one.col.matrix)
## [1] 6 1
dim(two.col.matrix)
## [1] 3 2
dim(one.elem.matrix)
## [1] 1 1
dim(no.elem.matrix)
## [1] 0 0
Arrays are similar to matrices, but can have more than two dimensions, which
are specified with the dim argument to the array() constructor.
## , , 1
##
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 10 13 16
## [2,] 11 14 17
## [3,] 12 15 18
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 19 22 25
## [2,] 20 23 26
## [3,] 21 24 27
B[2, 2, 2]
## [1] 14
In the chunk above, the length of the supplied vector is the product of the
dimensions, 27 = 3 × 3 × 3.
U How do you use indexes to extract the second element of the original
vector, in each of the following matrices and arrays?
Matrices and multidimensional arrays 55
v <- 1:10
v <- 1:10
a2c <- array(v, dim = c(5, 2), dimnames = list(NULL, c("c1", "c2")))
Be aware that vectors and one-dimensional arrays are not the same thing,
while two-dimensional arrays are matrices.
1. Use the different constructors and query methods to explore this, and
its consequences.
2. Convert a matrix into a vector using unlist() and as.vector() and
compare the returned values.
Operators for matrices are available in R, as matrices are used in many statistical
algorithms. We will not describe them all here, only t() and some specializations
of arithmetic operators. Function t() transposes a matrix, by swapping columns
and rows.
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
t(A)
## [1,] 1 2 3 4 5
## [2,] 6 7 8 9 10
## [3,] 11 12 13 14 15
## [4,] 16 17 18 19 20
A + 2
## [,1] [,2] [,3] [,4]
## [1,] 3 8 13 18
## [2,] 4 9 14 19
## [3,] 5 10 15 20
## [4,] 6 11 16 21
## [5,] 7 12 17 22
A * 0:1
56 The R language: “Words” and “sentences”
A * 1:0
## [,1] [,2] [,3] [,4]
## [1,] 1 0 11 0
## [2,] 0 7 0 17
## [3,] 3 0 13 0
## [4,] 0 9 0 19
## [5,] 5 0 15 0
In the examples above with the usual multiplication operator *, the operation
described is not a matrix product, but instead, the products between individual
elements of the matrix and vectors. Matrix multiplication is indicated by operator
%*%.
B %*% B
## [,1] [,2] [,3] [,4]
## [1,] 90 202 314 426
## [2,] 100 228 356 484
## [3,] 110 254 398 542
## [4,] 120 280 440 600
2.12 Factors
Factors are used to indicate categories, most frequently the factors describing the
treatments in an experiment, or categories in a survey. They can be created either
from numerical or character vectors. The different possible values are called levels.
Normal factors created with factor() are unordered or categorical. R also supports
ordered factors that can be created with function ordered().
my.factor
Factors 57
The labels (“names”) of the levels can be set when the factor is created. In this
case, when calling factor(), parameters levels and labels should both be passed
a vector as argument, with levels and matching labels in the same position in the
two vectors. The argument passed to levels determines the order of the levels
based on their old names or values, and the argument passed to labels gives new
names to the levels.
my.factor <- factor(x = my.vector, levels = c(1, 0), labels = c("treated", "control"))
my.factor
levels(my.factor)
## [1] "treated" "control"
It can be seen above that subsetting does not drop unused factor levels, and
that factor() can be used to explicitly drop the unused factor levels.
Converting factors into numbers is not intuitive, even in the case where a factor
was created from a numeric vector.
as.numeric(my.factor2)
## [1] 1 2 3 1 2 3 1 2 3 1 2 3
as.numeric(as.character(my.factor2))
## [1] 3 4 5 3 4 5 3 4 5 3 4 5
class(my.factor2)
## [1] "factor"
mode(my.factor2)
## [1] "numeric"
U Create a factor with levels labeled with words. Create another factor with
the levels labeled with the same words, but ordered differently. After this con-
vert both factors to numeric vectors using as.numeric(). Explain why the two
numeric vectors differ or not from each other.
levels(my.factor1) <- list("a" = "A", "d" = "Z", "c" = "B", "b" = "F")
my.factor1
## [1] a a a b b b c c c d d d
## Levels: a d c b
## [1] a a a F F F B B B d d d
## Levels: a F B d
Merging factor levels. We use factor() as shown below, setting the same
label for the levels we want to merge.
factor(my.factor1,
levels = c("A", "B", "F", "Z"),
labels = c("A", "B", "C", "C"))
## [1] A A A C C C B B B C C C
## Levels: A B C
levels(my.factor2)
levels(my.factor2)
levels(my.factor2)
levels(my.factor2)
very useful, especially when plotting. Function reorder() can be used in this
case. It defaults to using mean() for summaries, but other suitable functions
can be supplied in its place.
my.vector3 <- c(5.6, 7.3, 3.1, 8.7, 6.9, 2.4, 4.5, 2.1, 1.4, 2.0)
levels(my.factor3)
levels(my.factor3ord)
levels(my.factor3rev)
In the last statement, using the unary negation operator, which is vector-
ized, allows us to easily reverse the ordering of the levels, while still using the
default function, mean(), to summarize the data.
my.factor4
as.integer(my.factor4)
my.factor5
as.integer(my.factor5)
levels(my.factor5)[as.integer(my.factor5)]
We see above that the integer values by which levels in a factor are stored,
are equivalent to indices or “subscripts” referencing the vector of labels. Func-
tion sort() operates on the values’ underlying integers and sorts according to
the order of the levels while order() operates on the values’ labels and returns
a vector of indices that arrange the values alphabetically.
sort(my.factor4)
my.factor4[order(my.factor4)]
my.factor4[order(as.integer(my.factor4))]
Run the examples in the chunk above and work out why the results differ.
62 The R language: “Words” and “sentences”
2.13 Lists
Lists’ main difference to other collections is, in R, that they can be heterogeneous. In
R, the members of a list can be considered as following a sequence, and accessible
through numerical indexes, the same as vectors. However, frequently members
of a list are given names, and retrieved (indexed) through these names. Lists are
created using function list().
a.list <- list(x = 1:6, y = "a", z = c(TRUE, FALSE))
a.list
## $x
## [1] 1 2 3 4 5 6
##
## $y
## [1] "a"
##
## $z
## [1] TRUE FALSE
a.list[["x"]]
## [1] 1 2 3 4 5 6
a.list[[1]]
## [1] 1 2 3 4 5 6
a.list["x"]
## $x
## [1] 1 2 3 4 5 6
a.list[1]
## $x
## [1] 1 2 3 4 5 6
a.list[c(1,3)]
## $x
## [1] 1 2 3 4 5 6
##
## $z
## [1] TRUE FALSE
try(a.list[[c(1,3)]])
## [1] 3
Lists 63
## List of 3
## $ x: int [1:6] 1 2 3 4 5 6
## $ y: chr "a"
str(nested.list)
## List of 2
## $ A:List of 3
## $ B:List of 2
When dealing with deep lists, it is sometimes useful to limit the number
of levels of nesting returned by str() by means of a numeric argument passed
to parameter max.levels.
str(nested.list, max.level = 1)
## List of 2
## $ A:List of 3
## $ B:List of 2
U What do you expect each of the statements below to return? Before running
the code, predict what value and of which mode each statement will return. You
may use implicit or explicit calls to print(), or calls to str() to visualize the
structure of the different objects.
str(nested.list)
nested.list[2:1]
nested.list[1]
nested.list[[1]][2]
nested.list[[1]][[2]]
nested.list[2]
nested.list[2][[1]]
is.list(nested.list)
## [1] TRUE
is.list(c.vec)
## [1] FALSE
mode(nested.list)
## [1] "list"
mode(c.vec)
## [1] "character"
names(nested.list)
names(c.vec)
The returned value is a vector with named member elements. We use function
str() to figure out how this vector relates to the original list. The names are based
on the names of list elements when available, while numbers are used for anony-
mous nodes. We can access the members of the vector either through numeric
indexes or names.
str(c.vec)
c.vec[2]
## A2
## "aa"
c.vec["A2"]
## A2
## "aa"
a.df
## x y z
## 1 1 a TRUE
## 2 2 a FALSE
## 3 3 a TRUE
## 4 4 a FALSE
## 5 5 a TRUE
## 6 6 a FALSE
str(a.df)
## $ x: int 1 2 3 4 5 6
class(a.df)
Data frames 67
## [1] "data.frame"
mode(a.df)
## [1] "list"
is.data.frame(a.df)
## [1] TRUE
is.list(a.df)
## [1] TRUE
Indexing of data frames is similar to that of the underlying list, but not exactly
equivalent. We can index with operator [[]] to extract individual variables, thought
of being the columns in a matrix-like list or “worksheet.”
a.df$x
## [1] 1 2 3 4 5 6
a.df[["x"]]
## [1] 1 2 3 4 5 6
a.df[[1]]
## [1] 1 2 3 4 5 6
class(a.df)
## [1] "data.frame"
With function class() we can query the class of an R object (see section 2.8
on page 41). As we saw in the two previous chunks, list and data.frame objects
belong to two different classes. However, their relationship is based on a hierarchy
of classes. We say that class data.frame is derived from class list. Consequently,
data frames inherit the methods and characteristics of lists, as long as they have
not been hidden by new ones defined for data frames.
In the same way as with vectors, we can add members to lists and data frames.
str(a.df)
## $ x : int 1 2 3 4 5 6
## $ x2: int 6 5 4 3 2 1
We have added two columns to the data frame, and in the case of column x3
recycling took place. This is where lists and data frames differ substantially in their
behavior. In a data frame, although class and mode can be different for different
variables (columns), they are required to be vectors or factors of the same length.
In the case of lists, there is no such requirement, and recycling never takes place
when adding a node. Compare the values returned below for a.ls, to those in the
example above for a.df.
68 The R language: “Words” and “sentences”
str(a.ls)
## List of 3
## $ x: int [1:6] 1 2 3 4 5 6
## $ y: chr "a"
str(a.ls)
## List of 5
## $ x : int [1:6] 1 2 3 4 5 6
## $ y : chr "a"
Data frames are extremely important to anyone analyzing or plotting data using
R. One can think of data frames as tightly structured work-sheets, or as lists. As
you may have guessed from the examples earlier in this section, there are several
different ways of accessing columns, rows, and individual observations stored in a
data frame. The columns can be treated as members in a list, and can be accessed
both by name or index (position). When accessed by name, using $ or double square
brackets, a single column is returned as a vector or factor. In contrast to lists,
data frames are always “rectangular” and for this reason the values stored can
also be accessed in a way similar to how elements in a matrix are accessed, using
two indexes. As we saw for vectors, indexes can be vectors of integer numbers or
vectors of logical values. For columns they can, in addition, be vectors of character
strings matching the names of the columns. When using indexes it is extremely
important to remember that the indexes are always given row first.
Indexing of data frames can in all cases be done as if they were lists, which
is preferable, as it ensures compatibility with regular R lists and with newer
implementations of data-frame-like structures like those defined in package
‘tibble’. Using this approach, extracting two values from the second and third
positions in the first column of a.df is done as follows, using numerical in-
dexes.
a.df[[1]][2:3]
## [1] 2 3
a.df[["x"]][2:3]
## [1] 2 3
The less portable, matrix-like indexing is done as follows, with the first in-
dex indicating rows and the second one indicating columns. This notation al-
lows simultaneous extraction from multiple columns, which is not possible
with lists. The value returned is a “smaller” data frame.
Data frames 69
a.df[2:3, 1:2]
## x y
## 2 2 a
## 3 3 a
If the length of the column indexing vector is one, the returned value is a
vector, which is not consistent with the previous example which returned a
data frame. This is not only surprising in everyday use, but can be the source
of bugs when coding algorithms in which the length of the second index vector
cannot be guaranteed to be always more than one.
a.df[2:3, 1]
## [1] 2 3
# first row
a.df[1, ]
## x y z x2 x3
## 1 1 a TRUE 6 b
# the rows for which x > 3 keeping all columns except the third one
a.df[a.df$x > 3, -3]
## x y x2 x3
## 4 4 a 3 b
## 5 5 a 2 b
## 6 6 a 1 b
As explained earlier for vectors (see section 2.10 on page 45), indexing can be
70 The R language: “Words” and “sentences”
present both on the right-hand side and left-hand side of an assignment. The next
few examples do assignments to “cells” of a.df, either to one whole column, or
individual values. The last statement in the chunk below copies a number from
one location to another by using indexing of the same data frame both on the
right side and left side of the assignment.
a.df[1, 1] <- 99
a.df
## x y z x2 x3
## 1 99 a TRUE 6 b
## 2 2 a FALSE 5 b
## 3 3 a TRUE 4 b
## 4 4 a FALSE 3 b
## 5 5 a TRUE 2 b
## 6 6 a FALSE 1 b
We mentioned above that indexing by name can be done either with dou-
ble square brackets, [[]], or with $. In the first case the name of the variable
or column is given as a character string, enclosed in quotation marks, or as a
variable with mode character. When using $, the name is entered as a constant,
without quotation marks, and cannot be a variable.
Data frames 71
x.list[["abcd"]]
## [1] 123
x.list[[a.var]]
## [1] 123
x.list$abcd
## [1] 123
x.list$ab
## [1] 123
x.list$a
## [1] 123
Both in the case of lists and data frames, when using double square brack-
ets, an exact match is required between the name in the object and the name
used for indexing. In contrast, with $, any unambiguous partial match will be
accepted. For interactive use, partial matching is helpful in reducing typing.
However, in scripts, and especially R code in packages, it is best to avoid the
use of $ as partial matching to a wrong variable present at a later time, e.g.,
when someone else revises the script, can lead to very difficult-to-diagnose er-
rors. In addition, as $ is implemented by first attempting a match to the name
and then calling [[]], using $ for indexing can result in slightly slower perfor-
mance compared to using [[]]. It is possible to set an R option so that partial
matching triggers a warning, which can be very useful when debugging.
U What is the behavior of subset() when the condition is NA? Find the answer
72 The R language: “Words” and “sentences”
by writing code to test this, for a case where tests for different rows return NA,
TRUE and FALSE.
When calling functions that return a vector, data frame, or other structure, the
square brackets can be appended to the rightmost parenthesis of the function call,
in the same way as to the name of a variable holding the same data.
subset(a.df, x > 3)[ , -3]
## x y
## 4 4 a
## 5 5 a
## 6 6 a
None of the examples in the last three code chunks alter the original data frame
a.df. We can store the returned value using a new name if we want to preserve
a.df unchanged, or we can assign the result to a.df, deleting in the process, the
previously stored value.
In the example above, the names in the expression passed as the second
argument to subset() are first searched within ad.df but if not found, searched
in the environment. There being no variable A in the data frame a.df, vector A
from the environment is silently used in the expression resulting in a returned
data frame with no rows.
A <- 1
subset(a.df, A > 3)
## [1] x y z
The use of subset() is convenient, but more prone to result in bugs com-
pared to directly using the extraction operator []. This same “cost” to achiev-
ing convenience applies to functions like attach() and with() described below.
The longer time that a script is expected to be used, adapted and reused, the
more careful we should be when using any of these functions. An alternative
way of avoiding excessive verbosity is to keep the names of data frames short.
Instead of using the equality test, we can use the operator %in% or function
grepl() to delete multiple columns in a single statement.
U In the previous code chunk we deleted the last column of the data frame
a.df. Here is an esoteric trick for you to first untangle how it changes the
positions of columns and row, and then for you to think how and why it can
be useful to use indexing with the extraction operator [ ] on both sides of the
assignment operator <-.
ing a simple calculation can easily result in a long and difficult to read state-
ment. (Method head() is used here to limit the displayed value to the first two
rows—head() is described in section 2.17 on page 81.)
Using attach() we can alter how R looks up names and consequently sim-
plify the statement. With detach() we can restore the original state. It is im-
portant to remember that here we can only simplify the right-hand side of the
assignment, while the “destination” of the result of the computation still needs
to be fully specified on the left-hand side of the assignment operator. We in-
clude below only one statement between attach() and detach() but multiple
statements are allowed. Furthermore, if variables with the same name as the
columns exist in the search path, these will take precedence, something that
can result in bugs or crashes, or as seen below, a message warns that variable
A from the global environment will be used instead of column A of the attached
my_data_frame.df. The returned value is, of course, not the desired one.
##
## A
my_data_frame.df$C <- (A + B) / A
detach(my_data_frame.df)
head(my_data_frame.df, 2)
## A B C
## 1 1 3 4
## 2 2 3 4
In the case of with() only one, possibly compound code statement is af-
fected and this statement is passed as an argument. As before, we need to
fully specify the left-hand side of the assignment. The value returned is the
one returned by the statement passed as an argument, in the case of compound
statements, the value returned by the last contained simple code statement to
be executed. Consequently, if the intent is to modify the container, assignment
to an individual member variable (column in this case) is required. In contrast
to the behavior of attach(), In this case, column A of my_data_frame.df takes
precedence, and the returned value is the expected one.
Data frames 75
head(my_data_frame.df, 2)
## A B C
## 1 1 3 4.0
## 2 2 3 2.5
head(my_data_frame.df, 2)
## A B C
## 1 1 3 4.0
## 2 2 3 2.5
{C <- (A + B) / A
D <- A * B
E <- A / B + 1}
)
head(my_data_frame.df, 2)
## A B E D C
## 1 1 3 1.333333 3 4.0
## 2 2 3 1.666667 6 2.5
the original position and target position we can use numerical indexes on both
right-hand side and left-hand side of an assignment.
To retain the correct naming after the column swap, we need to separately swap
the names of the columns.
Taking into account that order() returns the indexes needed to sort a vector
(see page 49), we can use order() to generate the indexes needed to sort either
columns or rows of a data frame. When we want to sort rows, the argument to
order() is usually a column of the data frame being arranged. However, any vec-
tor of suitable length, including the result of applying a function to one or more
columns, can be passed as an argument to order().
U The first task to be completed is to sort a data frame based on the values
in one column, using indexing and order(). Create a new data frame and with
three numeric columns with three different haphazard sequences of values.
Call these columns A, B and C. 1) Sort the rows of the data frame so that the
values in A are in decreasing order. 2) Sort the rows of the data frame according
to increasing values of the sum of A and B without adding a new column to the
data frame or storing the vector of sums in a variable. In other words, do the
sorting based on sums calculated on the fly.
U Repeat the tasks in the playground immediately above but using fac-
Attributes of R objects 77
tors instead of numeric vectors as columns in the data frame. Hint: revisit the
exercise on page 61 were the use of order() on factors is described.
comment(a.df)
## NULL
comment(a.df)
Methods like names(), dim() or levels() return values retrieved from attributes
stored in R objects, and methods like names()<-, dim()<- or levels()<- set (or un-
set with NULL) the value of the respective attributes. Specific query and set methods
do not exist for all attributes. Methods attr(), attr()<- and attributes() can be
used with any attribute. In addition, method str() displays all components of R
objects including their attributes.
names(a.df)
names(a.df)
attributes(a.df)
## $names
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6
##
## $comment
##
## $my.attribute
attributes(numbers)
## NULL
attributes(a.factor)
## $levels
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
##
## $class
## [1] "factor"
data(cars)
Once we have a data set available, the first step is usually to explore it, and we
will do this with cars in section 2.17 on page 81.
We delete the data frame object and confirm that it is no longer present in the
workspace.
rm(my.df)
ls(pattern = "my.df")
## character(0)
my.df
## x y
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
80 The R language: “Words” and “sentences”
The default format used is binary and compressed, which results in smaller
files.
U In the example above, only one object was saved, but one can simply give
the names of additional objects as arguments. Just try saving more than one
data frame to the same file. Then the data frames plus a few vectors. After
creating each file, clear the workspace and then restore from the file the objects
you saved.
The two statements above can be combined into a single statement by nesting
the function calls.
U Practice using different patterns with ls(). You do not need to save the
objects to a file. Just have a look at the list of object names returned.
As a coda, we show how to clean up by deleting the two files we created. Func-
tion unlink() can be used to delete any files for which the user has enough rights.
unlink(c("my-df.rda", "my-df1.rda"))
saveRDS(my.df, "my-df.rds")
If we read the file, by default the read R object will be printed at the console.
readRDS("my-df.rds")
## x y
## 1 1 5
## 2 2 4
## 3 3 3
## 4 4 2
## 5 5 1
In the next example we assign the read object to a different name, and check
that the object read is identical to the one saved.
unlink("my-df.rds")
class(cars)
## [1] "data.frame"
nrow(cars)
## [1] 50
ncol(cars)
## [1] 2
names(cars)
## [1] "speed" "dist"
head(cars)
82 The R language: “Words” and “sentences”
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
## 6 9 10
tail(cars)
## speed dist
## 45 23 54
## 46 24 70
## 47 24 92
## 48 24 93
## 49 24 120
## 50 25 85
str(cars)
## 'data.frame': 50 obs. of 2 variables:
## $ speed: num 4 4 7 7 8 9 10 10 10 11 ...
## $ dist : num 2 10 4 22 16 10 18 26 34 17 ...
U Look up the help pages for head() and tail(), and edit the code above to
print only the first two lines, or only the last three lines of cars, respectively.
The statement above returns a vector of character strings, with the mode of
each column. Each element of the vector is named according to the name of the
corresponding “column” in the data frame. For this same statement to be used with
any other data frame or list, we need only to substitute the name of the object, the
argument to the first parameter called X, to the one of current interest.
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
sapply(cars, range)
## speed dist
## [1,] 4 2
## [2,] 25 120
2.18 Plotting
The base-R generic method plot() can be used to plot different data. It is a generic
method that has specializations suitable for different kinds of objects (see sec-
tion 5.4 on page 172 for a brief introduction to objects, classes and methods). In
this section we only very briefly demonstrate the use of the most common base-R
graphics functions. They are well described in the book R Graphics (Murrell 2019).
We will not describe the Lattice (based on S’s Trellis) approach to plotting (Sarkar
2008). Instead we describe in detail the use of the grammar of graphics and plot-
ting with package ‘ggplot2’ in chapter 7 starting on page 203.
It is possible to pass two variables (here columns from a data frame) directly as
arguments to the x and y parameters of plot().
120
20 40 60 80
cars$dist
5 10 15 20 25
cars$speed
It is also possible, and usually more convenient, to use a formula to specify the
variables to be plotted on the 𝑥 and 𝑦 axes, passing additionally as an argument to
parameter data the name of the data frame containing these variables. The formula
dist ~ speed, is read as dist explained by speed—i.e., dist is mapped to the 𝑦-axis
as the dependent variable and speed to the 𝑥-axis as the independent variable.
plot(dist ~ speed, data = cars)
120
20 40 60 80
dist
5 10 15 20 25
speed
400
300
weight
200
100
feed
Method plot() and variants defined in R, when used for plotting return their
graphical output to a graphical output device. In R, graphical devices are very fre-
quently not real physical devices like a printer, but virtual devices implemented
fully in software that translate the plotting commands into a specific graphical file
format. Several different graphical devices are available in R and they differ in the
kind of output they produce: raster files (e.g., TIFF, PNG and JPEG formats), vector
graphics files (e.g., SVG, EPS and PDF) or output to a physical device like a window
in the screen of a computer. Additional devices are available through contributed
R packages.
Devices follow the paradigm of ON and OFF switches. Some devices producing
a file as output, save this output only when the device is closed. When opening a
device the user supplies additional information. For the PDF device that produces
output in a vector-graphics format, width and height of the output are specified in
inches. A default file name is used unless we pass a character string as an argument
to parameter file.
dev.off()
## cairo_pdf
## 2
Raster devices return bitmaps and width and height are specified in pixels.
dev.off()
## cairo_pdf
## 2
dev.off()
## cairo_pdf
## 2
Jim Lemon
Kickstarting R
87
88 The R language: “Paragraphs” and “essays”
code statements. In the rest of the present section I discuss how to write readable
and reliable scripts and how to use them.
• The file contains valid R statements (including comments) and nothing else.
• The R statements are in the file in the order that they must be executed.
print(3 + 4)
source("my.first.script.r")
## [1] 7
The results of executing the statements contained in the file will appear in the
console. The commands themselves are not shown (by default the sourced file is
not echoed to the console) and the results will not be printed unless you include
explicit print() commands in the script. This applies in many cases also to plots—
e.g., a figure created with ggplot() needs to be printed if we want it to be included
in the output when the script is run. Adding a redundant print() is harmless.
From within RStudio, if you have an R script open in the editor, there will be a
“source” icon visible with an attached drop-down menu from which you can choose
“Source” as described above, or “Source with echo,” or “Source as local job” for the
script in the currently active editor tab.
When a script is sourced, the output can be saved to a text file instead of being
Writing scripts 89
shown in the console. It is also easy to call R with the R script file as an argument
directly at the operating system shell or command-interpreter prompt—and obvi-
ously also from shell scripts. The next two chunks show commands entered at the
OS shell command prompt rather than at the R command prompt.
You can open an operating system’s shell from the Tools menu in RStudio, to
run this command. The output will be printed to the shell console. If you would like
to save the output to a file, use redirection using the operating system’s syntax.
Sourcing is very useful when the script is ready, however, while developing a
script, or sometimes when testing things, one usually wants to run (or execute)
one or a few statements at a time. This can be done using the “run” button1 after
either positioning the cursor in the line to be executed, or selecting the text that
one would like to run (the selected text can be part of a line, a whole line, or a
group of lines, as long as it is syntactically valid). The key-shortcut Ctrl-Enter is
equivalent to pressing the “run” button in RStudio.
If one is very familiar with similar problems One would just create a new text
file and write the whole thing in the editor, and then test it. This is rather unusual.
If one is moderately familiar with the problem One would write the script as
above, but testing it, step by step, as one is writing it. This is usually what I
do.
If one is mostly playing around Then if one is using RStudio, one can type state-
ments at the console prompt. As you should know by now, everything you run
at the console is saved to the “History.” In RStudio, the History is displayed in
its own pane, and in this pane one can select any previous statement(s) and by
clicking on a single icon, copy and paste them to either the R console prompt, or
the cursor position in the editor pane. In this way one can build a script by copy-
ing and pasting from the history to your script file, the bits that have worked as
you wanted.
1
If you use a different IDE or editor with an R mode, the details will vary, but a run command will be
usually available.
90 The R language: “Paragraphs” and “essays”
U By now you should be familiar enough with R to be able to write your own
script.
1. Create a new R script (in RStudio, from the File menu, “+” icon, or by
typing “Ctrl + Shift + N”).
2. Save the file as my.second.script.r.
3. Use the editor pane in RStudio to type some R commands and com-
ments.
4. Run individual commands.
5. Source the whole file.
a <- 2 # height
b <- 4 # length
C <-
a *
b
C -> variable
print(
"area: ", variable
)
The points discussed above already help a lot. However, one can go further in
achieving the goal of human readability by interspersing explanations and code
“chunks” and using all the facilities of typesetting, even of formatted maths for-
mulas and equations, within the listing of the script. Furthermore, by including
the results of the calculations and the code itself in a typeset report built auto-
matically, we can ensure that the results are indeed the result of running the code
shown. This greatly contributes to data analysis reproducibility, which is becoming
a widespread requirement for any data analysis both in academia and in industry.
It is possible not only to typeset whole books like this one, but also whole data-
based web sites with these tools.
In the realm of programming, this approach is called literate programming and
was first proposed by Donald Knuth (Knuth 1984) through his WEB system. In the
case of R programming, the first support of literate programming was through
‘Sweave’, which has been mostly superseded by ‘knitr’ (Xie 2013). This package
supports the use of Markdown or LATEX (Lamport 1994) as the markup language
for the textual contents and also formats and adds syntax highlighting to code
chunks. Rmarkdown is an extension to Markdown that makes it easier to include
R code in documents (see http://rmarkdown.rstudio.com/). It is the basis of
R packages that support typesetting large and complex documents (‘bookdown’),
web sites (‘blogdown’), package vignettes (‘pkgdown’) and slides for presentations
(Xie 2016; Xie et al. 2018). The use of ‘knitr’ is very well integrated into the RStudio
IDE.
This is not strictly an R programming subject, as it concerns programming in
any language. On the other hand, this is an incredibly important skill to learn, but
well described in other books and web sites cited in the previous paragraph. This
whole book, including figures, has been generated using ‘knitr’ and the source code
for the book is available through Bitbucket at https://bitbucket.org/aphalo/
learnr-book.
It has been just so in all of my inventions. The first step is an intuition, and
comes with a burst, then difficulties arise–this thing gives out and [it is] then
that “Bugs”–as such little faults and difficulties are called–show themselves and
months of intense watching, study and labor are requisite before commercial
success or failure is certainly reached.
The quoted paragraph above makes clear that only very exceptionally does any
new design fully succeed. The same applies to R scripts as well as any other non-
trivial piece of computer code. From this it logically follows that testing and de-
bugging are fundamental steps in the development of R scripts and packages.
Debugging, as an activity, is outside the scope of this book. However, clear pro-
gramming style and good documentation are indispensable for efficient testing
and reuse.
Even for scripts used for analyzing a single data set, we need to be confident
that the algorithms and their implementation are valid, and able to return correct
results. This is true both for scientific reports, expert data-based reports and any
data analysis related to assessment of compliance with legislation or regulations.
Of course, even in cases when we are not required to demonstrate validity, say
for decision making purely internal to a private organization, we will still want to
avoid costly mistakes.
The first step in producing reliable computer code is to accept that any code
that we write needs to be tested and, if possible, validated. Another important
step is to make sure that input is validated within the script and a suitable error
produced for bad input (including valid input values falling outside the range that
can be reliably handled by the script).
If during testing, or during normal use, a wrong value is returned by a cal-
culation, or no value (e.g., the script crashes or triggers a fatal error), debugging
consists in finding the cause of the problem. The cause can be either a mistake in
the implementation of an algorithm, as well as in the algorithm itself. However,
many apparent bugs are caused by bad or missing handling of special cases like
invalid input values, rounding errors, division by zero, etc., in which a program
crashes instead of elegantly issuing a helpful error message.
Diagnosing the source of bugs is, in most cases, like detective work. One uses
hunches based on common sense and experience to try to locate the lines of code
causing the problem. One follows different leads until the case is solved. In most
cases, at the very bottom we rely on some sort of divide-and-conquer strategy. For
example, we may check the value returned by intermediate calculations until we
locate the earliest code statement producing a wrong value. Another common case
is when some input values trigger a bug. In such cases it is frequently best to start
by testing if different “cases” of input lead to errors/crashes or not. Boundary input
values are usually the telltale ones: e.g., for numbers, zero, negative and positive
values, very large values, very small values, missing values (NA), vectors of length
zero (numeric()), etc.
Writing scripts 93
Error messages When debugging, keep in mind that in some cases a single
bug can lead to a whole cascade of error messages. Do also keep in mind that
typing mistakes, originating when code is entered through the keyboard, can
wreak havock in a script: usually there is little correspondence between the
number of error messages and the seriousness of the bug triggering them.
When several errors are triggered, start by reading the error message printed
first, as later errors can be an indirect consequence of earlier ones.
There are special tools, called debuggers, available, and they help enormously.
Debuggers allow one to step through the code, executing one statement at a time,
and at each pause, allowing the user to inspect the objects present in the R en-
vironment and their values. It is even possible to execute additional statements,
say, to modify the value of a variable, while execution is paused. An R debugger is
available within RStudio and also through the R console.
When writing your first scripts, you will manage perfectly well, and learn more
by running the script one line at a time and when needed temporarily inserting
print() statements to “look” at how the value of variables changes at each step. A
debugger allows a lot more control, as one can “step in” and “step out” of function
definitions, and set and unset break points where execution will stop, which is
especially useful when developing R packages.
When reproducing the examples in this chapter, do keep this section in mind.
In addition, if you get stuck trying to find the cause of a bug, do extend your search
both to the most trivial of possible causes, and to the least likely ones (such as a
bug in a package installed from CRAN or R itself). Of course, when suspecting a bug
in code you have not written, it is wise to very carefully read the documentation,
as the “bug” may be just in your understanding of what a certain piece of code
is expected to do. Also keep in mind that as discussed on page 12, you will be
able to find online already-answered questions to many of your likely problems
and doubts. For example, Googling for the text of an error message is usually well
rewarded.
When installing packages from other sources than CRAN (e.g., develop-
ment versions from GitHub, Bitbucket or R-Forge, or in-house packages) there
is no warranty that conflicts will not happen. Packages (and their versions) re-
leased through CRAN are regularly checked for inter-compatibility, while pack-
ages released through other channels are usually checked against only a few
packages.
Conflicts among packages can easily arise, for example, when they use the
same names for objects or functions. In addition, many packages use functions
defined in packages in the R distribution itself or other independently devel-
oped packages by importing them. Updates to depended-upon packages can
“break” (make non-functional) the dependent packages or parts of them. The
rigorous testing by CRAN detects such problems in most cases when package
revisions are submitted, forcing package maintainers to fix problems before
94 The R language: “Paragraphs” and “essays”
print("A")
## [1] "A"
{
print("B")
print("C")
}
## [1] "B"
## [1] "C"
(i.e., switching ON and OFF) parts of a script based on the result returned by a
logical expression. This expression can also be a flag—i.e., a logical variable set
manually, preferable near the top of the script. Use of flags is most useful when
switching between two script behaviors depends on multiple sections of code. A
frequent use case for flags is jointly enabling and disabling printing of output from
multiple statements scattered in over a long script.
R has two types of if statements, non-vectorized and vectorized. We will start
with the non-vectorized one, which is similar to what is available in most other
computer programming languages. We start with toy examples demonstrating how
if and if-else statements work. Later we will see examples closer to real use cases.
if (flag) print("Hello!")
## [1] "Hello!"
U Play with the code above by changing the value assigned to variable flag,
FALSE, NA, and logical(0).
In the example above we use variable flag as the condition.
Nothing in the R language prevents this condition from being a logical
constant. Explain why if (TRUE) in the syntactically-correct statement below
is of no practical use.
if (TRUE) print("Hello!")
## [1] "Hello!"
Conditional execution is much more useful than what could be expected from
the previous example, because the statement whose execution is being controlled
can be a compound statement of almost any length or complexity. A very simple
example follows.
a <- 10.0
U Play with the code in the chunk above by assigning different numeric vec-
tors to a.
Why does the statement below (not evaluated here) trigger an error while
the one above does not?
How do the continuation line rules apply when we add curly braces as
shown below.
# 1
a <- 1
if (a < 0.0) {
print("'a' is negative")
} else {
print("'a' is not negative")
}
## [1] "'a' is not negative"
U Play with the use of conditional execution, with both simple and compound
Control of execution flow 97
statements, and also think how to combine if and else to select among more
than two options.
U Study the conversion rules between numeric and logical values, run each
of the statements below, and explain the output based on how type conversions
are interpreted, remembering the difference between floating-point numbers as
implemented in computers and real numbers (ℝ) as defined in mathematics.
if (0) print("hello")
if (-1) print("hello")
if (0.01) print("hello")
if (1e-300) print("hello")
if (1e-323) print("hello")
if (1e-324) print("hello")
if (1e-500) print("hello")
if (as.logical("true")) print("hello")
if (as.logical(as.numeric("1"))) print("hello")
if (as.logical("1")) print("hello")
if ("1") print("hello")
Hint: if you need to refresh your understanding of the type conversion rules,
see section 2.9 on page 42.
U Do play with the use of the switch statement. Look at the documentation
for switch() using help(switch) and study the examples at the end of the help
page.
The switch() statement can substitute for chained if else statements when
all the conditions are comparisons against different constant values, resulting in
more concise and clear code.
U Some additional examples to play with, with a few surprises. Study the
examples below until you understand why returned values are what they are.
In addition, create your own examples to test other possible cases. In other
words, play with the code until you fully understand how ifelse works.
a <- 1:10
ifelse(a > 5, 1, -1)
ifelse(a > 5, a + 1, a - 1)
ifelse(any(a > 5), a + 1, a - 1) # tricky
ifelse(logical(0), a + 1, a - 1) # even more tricky
ifelse(NA, a + 1, a - 1) # as expected
Control of execution flow 99
a <- -10:-1
b <- +1:10
c <- c(rep("a", 5), rep("b", 5))
# your code
If you do not understand how the three vectors are built, or you cannot
guess the values they contain by reading the code, print them, and play with the
arguments, until you understnd what each parameter does. Also use help(rep)
and/or help(ifelse) to access the documentation.
100 The R language: “Paragraphs” and “essays”
3.3.3 Iteration
We give the name iteration to the process of repetitive execution of a program state-
ment (simple or compound)—e.g., computed by iteration. We use the same word,
iteration, to name each one of these repetitions of the execution of a statement—
e.g., the second iteration.
The section of computer code being executed multiple times, forms a loop (a
closed path). Most loops contain a condition that determines when the flow of ex-
ecution will exit the loop and continue at the next statement following the loop.
In R three types of iteration loops are available: those using for, while and repeat
constructs. They differ in how much flexibility they provide with respect to the val-
ues they iterate over, and how the condition that terminates the iteration is tested.
When the same algorithm can be implemented with more than one of these con-
structs, using the least flexible of them usually results in the easiest to understand
R scripts. In R, rather frequently, explicit loops as described in this section can be
replaced advantageously by calls to the apply functions described in section 3.4
on page 108.
The most frequently used type of loop is a for loop. These loops work in R on lists
or vectors of values to act upon.
b <- 0
for (a in 1:5) b <- b + a
b
## [1] 15
Here the statement b <- b + a is executed five times, with variable a sequen-
tially taking each of the values in 1:5. Instead of a simple statement used here, a
compound statement could also have been used for the body of the for loop.
a <- c(1, 4, 3, 6, 8)
for(x in a) {print(x*2)} # print is needed!
## [1] 2
## [1] 8
## [1] 6
## [1] 12
## [1] 16
A call to for does not return a value. We need to assign values to an object
Control of execution flow 101
so that they are not lost. If we print at each iteration the value of this object, we
can follow how the stored value changes. Printing allows us to see, how the vector
grows in length, unless we create a long-enough vector before the start of the loop.
b <- numeric()
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
print(b)
}
## [1] 1
## [1] 1 16
## [1] 1 16 9
## [1] 1 16 9 36
## [1] 1 16 9 36 64
b
## [1] 1 16 9 36 64
b
## [1] 1 16 9 36 64
U Look at the results from the above examples, and try to understand where
the returned value comes from in each case. In the code chunk above, print()
is used within the loop to make intermediate values visible. You can add addi-
tional print() statements to visualize other variables, such as i, or run parts
of the code, such as seq(along.with = a), by themselves.
In this case, the code examples trigger no errors or warnings, but the same
102 The R language: “Paragraphs” and “essays”
approach can be used for debugging syntactically correct code that does not
return the expected results.
b <- numeric(length(a))
for(i in seq(along.with = a)) {
b[i] <- a[i]^2
}
print(b)
c <- numeric(length(a))
for(i in 1:length(a)) {
c[i] <- a[i]^2
}
print(c)
for loops as described above, in the absence of errors, have statically pre-
dictable behavior. The compound statement in the loop will be executed once
for each member of the vector or list. Special cases may require the alteration
of the normal flow of execution in the loop. Two cases are easy to deal with,
one is stopping iteration early, which we can do with break(), and another is
jumping ahead to the start of the next iteration, which we can do with next().
while loops are frequently useful, even if not as frequently used as for loops.
Instead of a list or vector, they take a logical argument, which is usually an expres-
sion, but which can also be a variable.
a <- 2
while (a < 50) {
print(a)
a <- a^2
}
## [1] 2
## [1] 4
## [1] 16
print(a)
## [1] 256
Control of execution flow 103
U Make sure that you understand why the final value of a is larger than 50.
print(a)
Explain why this works, and how it relates to the support in R of chained
assignments to several variables within a single statement like the one below.
Explain why a second print(a) has been added before while(). Hint: exper-
iment if necessary.
while loops as described above will terminate when the condition tested
is FALSE. In those cases that require stopping iteration based on an additional
test condition within the compound statement, we can call break() in the body
of an if or else statement.
The repeat construct is less frequently used, but adds flexibility as termination
will always depend on a call to break(), which can be located anywhere within the
compound statement that forms the body of the loop. To achieve conditional end
of iteration, function break() must be called, as otherwise, iteration in a repeat
loop will not stop.
a <- 2
repeat{
print(a)
if (a > 50) break()
a <- a^2
}
## [1] 2
## [1] 4
## [1] 16
## [1] 256
104 The R language: “Paragraphs” and “essays”
U Please explain why the example above returns the values it does. Use the
approach of adding print() statements, as described on page 101.
Although repeat loop constructs are easier to read if they have a single
condition resulting in termination of iteration, it is allowed by the R language
for the compound statement in the body of a loop to contain more than one
call to break(), each within a different if or else statement.
Whenever working with large data sets, or many similar data sets, we will
need to take performance into account. As vectorization usually also makes
code simpler, it is good style to use vectorization whenever possible. For op-
erations that are frequently used, R includes specific functions. It is thus im-
portant to consider not only vectorization of arithmetic but also check for the
availability of performance-optimized functions for specific cases. The results
from running the code examples in this box are not included, because they are
Control of execution flow 105
the same for all chunks. Here we are interested in the execution time, and we
leave this as an exercise.
# b <- numeric()
b <- numeric(length(a)-1) # pre-allocate memory
i <- 1
while (i < length(a)) {
b[i] <- a[i+1] - a[i]
print(b)
i <- i + 1
}
b
# b <- numeric()
b <- numeric(length(a)-1) # pre-allocate memory
for(i in seq(along.with = b)) {
b[i] <- a[i+1] - a[i]
print(b)
}
b
# or even better
b <- diff(a)
b
Execution time can be obtained with system.time(). For a vector of ten mil-
lion numbers, the for loop above takes 1.1 s and the equivalent while loop
2.0 s, the vectorized statement using indexing takes 0.2 s and function diff()
takes 0.1 s. The for loop without pre-allocation of memory to b takes 3.6 s, and
the equivalent while loop 4.7 s—i.e., the fastest execution time was more than
40 times faster than the slowest one. (Times for R 3.5.1 on my laptop under
Windows 10 x64.)
## [1,] 1 11 21 31 41
## [2,] 2 12 22 32 42
## [3,] 3 13 23 33 43
## [4,] 4 14 24 34 44
## [5,] 5 15 25 35 45
## [6,] 6 16 26 36 46
## [7,] 7 17 27 37 47
## [8,] 8 18 28 38 48
## [9,] 9 19 29 39 49
## [10,] 10 20 30 40 50
The code above is very general, it will work with any two-dimensional matrix
with at least one column and one row. However, sometimes we need more specific
calculations. A[1, 2] selects one cell in the matrix, the one on the first row of the
second column. A[1, ] selects row one, and A[ , 2] selects column two. In the
example above, the value of i changes for each iteration of the outer loop. The
value of j changes for each iteration of the inner loop, and the inner loop is run in
full for each iteration of the outer loop. The inner loop index j changes fastest.
U 1) Modify the code in the example in the last chunk above so that
it sums the values only in the first three columns of A, 2) modify the same
example so that it sums the values only in the last three rows of A, 3) modify
the code so that matrices with dimensions equal to zero (as reported by ncol()
and nrow()).
Will the code you wrote continue working as expected if the number of rows
in A changed? What if the number of columns in A changed, and the required
results still needed to be calculated for relative positions? What would happen
if A had fewer than three columns? Try to think first what to expect based
on the code you wrote. Then create matrices of different sizes and test your
code. After that, think how to improve the code, so that wrong results are not
produced.
If the total number of iterations is large and the code executed at each
iteration runs fast, the overhead added by the loop code can make a big con-
tribution to the total running time of a script. When dealing with nested loops,
as the inner loop is executed most frequently, this is the best place to look for
ways of reducing execution time. In this example, vectorization can be achieved
Control of execution flow 107
easily for the inner loop, as R has a function sum() which returns the sum of a
vector passed as its argument. Replacing the inner loop by an efficient function
can be expected to improve performance significantly.
A[i, ] selects row i and all columns. Reminder: in R the row index comes
first.
Both explicit loops can be eliminated if we use an apply function, such as
apply(), lapply() or sapply(), in place of the outer for loop. See section 3.4
below for details on the use of the different apply functions.
rowSums(A)
## [1] 105 110 115 120 125 130 135 140 145 150
U 1) How would you change this last example, so that only the last three
columns are added up? (Think about use of subscripts to select a part of the
matrix.) 2) To obtain column sums, one could modify the nested loops (think
how), transpose the matrix and use rowSums() (think how), or look up if there
is in R a function for this operation. A good place to start is with help(rowSums)
as similar functions may share the same help page, or at least be listed in the
“See also” section. Do try this, and explore other help pages in search for some
function you may find useful in the analysis of your own data.
3.3.5.1 Clean-up
Sometimes we need to make sure that clean-up code is executed even if the ex-
ecution of a script or function is aborted by the user or as a result of an error
condition. A typical example is a script that temporarily sets a disk folder as the
working directory or uses a file as temporary storage. Function on.exit() can be
used to record that a user supplied expression needs to be executed when the cur-
108 The R language: “Paragraphs” and “essays”
rent function, or a script, exits. Function on.exit() can also make code easier to
read as it keeps creation and clean-up next to each other in the body of a function
or in the listing of a script.
file.create("temp.file")
## [1] TRUE
on.exit(file.remove("temp.file"))
# code that makes use of the file goes here
str(z)
## List of 6
## $ : num 4.77
## $ : num 4.72
## $ : num 4.06
## $ : num 3.93
## $ : num 3.98
## $ : num 3.38
str(z)
str(z)
## List of 6
## $ : num 4.77
## $ : num 4.72
## $ : num 4.06
## $ : num 3.93
## $ : num 3.98
## $ : num 3.38
We can see above that the computed results are the same in the three cases,
but the class and structure of the objects returned differ.
Anonymous functions can be defined on the fly and passed to FUN, allowing us
to re-write the examples above more concisely (only the second one shown).
str(z)
z <- log(a.vector) + 5
str(z)
set.seed(123456)
str(a.list)
## List of 5
We define the function that we will apply, a function that returns a numeric
vector of length 2.
We next use vapply() to apply our function to each member vector of the
list.
values
## [,1] [,2] [,3] [,4] [,5]
## mean 10.0725427 11.7254442 9.657997 10.5573814 10.605846
## sd 0.5428149 0.7844356 1.050663 0.6460881 1.005676
U Modify the example above so that it computes row means instead of col-
umn means.
U Look up the help pages for apply() and mean() and study them until you
understand how additional arguments can be passed to the applied function.
Can you guess why apply() was designed to have parameter names fully in
uppercase, something very unusual for R code style?
If we apply a function that returns a value of the same length as its input,
then the dimensions of the value returned by apply() are the same as those of its
input. We use, in the next examples, a “no-op” function that returns its argument
unchanged, so that input and output can be easily compared.
class(z)
## [,1] [,2]
t(z)
## [,1] [,2]
## [1,] 11.3 10.4
## [2,] 10.6 8.6
## [3,] 8.2 11.0
A more realistic example, but difficult to grasp without seeing the toy exam-
ples shown above, is when we apply a function that returns a value of a different
length than its input, but longer than one. When we compute column summaries
(MARGIN = 2), a matrix is returned, with each column containing the summaries
for the corresponding column in the original matrix (a.small.matrix). In contrast,
when we compute row summaries (MARGIN = 1), each column in the returned ma-
trix contains the summaries for one row in the original array. What happens is
that by using apply() the dimension of the original matrix or array over which we
compute summaries “disappears.” Consequently, given how matrices are stored in
R, when columns collapse into a single value, the rows become columns. After this,
the vectors returned by the applied function, are stored as rows.
str(z)
Apply functions vs. loop constructs Apply functions cannot always re-
place explicit loops as they are less flexible. A simple example is the accu-
mulation pattern, where we “walk” through a collection that stores a partial
result between iterations. A similar case is a pattern where calculations are
done over a “window” that moves at each iteration. The simplest and probably
most frequent calculation of this kind is the calculation of differences between
successive members. Other examples are moving window summaries such as
a moving median (see page 104 for other alternatives to the use of explicit
iteration loops).
assign("a", 9.99)
a
## [1] 9.99
114 The R language: “Paragraphs” and “essays”
The two toy examples above do not demonstrate why one may want to use
assign(). Common situations where we may want to use character strings to store
(future or existing) object names are 1) when we allow users to provide names for
objects either interactively or as character data, 2) when in a loop we transverse
a vector or list of object names, or 3) we construct at runtime object names from
multiple character strings based on data or settings. A common case is when we
import data from a text file and we want to name the object according to the name
of the file on disk, or a character string read from the header at the top of the file.
Another case is when character values are the result of a computation.
for (i in 1:5) {
assign(paste("zz_", i, sep = ""), i^2)
}
ls(pattern = "zz_*")
## [1] "zz_1" "zz_2" "zz_3" "zz_4" "zz_5"
get("b")
## [1] 9.99
To close this chapter, I will mention some advanced aspects of the R language
that are useful when writing complex scrips—if you are going through the book
sequentially, you will want to return to this section after reading chapters 4 and 5.
In the same way as we can assign names to numeric, character and other types of
objects, we can assign names to functions and expressions. We can also create lists
of functions and/or expressions. The R language has a very consistent grammar,
with all lists and vectors behaving in the same way. The implication of this is that
we can assign different functions or expressions to a given name, and consequently
it is possible to write loops over lists of functions or expressions.
In this first example we use a character vector of function names, and use func-
tion do.call() as it accepts either character strings or function names as its first
argument. We obtain a numeric vector with named members with names matching
the function names.
x <- rnorm(10)
results
## mean max min
## 0.5453427 2.5026454 -1.1139499
When traversing a list of functions in a loop, we face the problem that we cannot
access the original names of the functions as what is stored in the list are the
definitions of the functions. In this case, we can hold the function definitions in
the loop variable (f in the chunk below) and call the functions by use of the function
call notation (f()). We obtain a numeric vector with anonymous members.
results <- numeric()
funs <- list(mean, max, min)
for (f in funs) {
results <- c(results, f(x))
}
results
## [1] 0.5453427 2.5026454 -1.1139499
We can use a named list of functions to gain full control of the naming of the
results. We obtain a numeric vector with named members with names matching
the names given to the list members.
results <- numeric()
for (f in names(funs)) {
results
## average maximum minimum
## 0.5453427 2.5026454 -1.1139499
116 The R language: “Paragraphs” and “essays”
Next is an example using model formulas. We use a loop to fit three models, ob-
taining a list of fitted models. We cannot pass to anova() this list of fitted models,
as it expects each fitted model as a separate nameless argument to its … param-
eter. We can get around this problem using function do.call() to call anova().
Function do.call() passes the members of the list passed as its second argument
as individual arguments to the function being called, using their names if present.
anova() expects nameless arguments so we need to remove the names present in
results.
for (m in names(models)) {
str(results, max.level = 1)
## List of 3
## $ linear :List of 12
## $ linear.orig:List of 12
## $ quadratic :List of 12
do.call(anova, unname(results))
##
## Model 1: y ~ x
## Model 2: y ~ x - 1
## Model 3: y ~ x + I(x^2)
## 1 8 0.05525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
If we had no further use for results we could simply build a list with nameless
members by using positional indexing.
str(results, max.level = 1)
## List of 3
## $ :List of 12
## $ :List of 12
## $ :List of 12
do.call(anova, results)
##
## Model 1: y ~ x
## Model 2: y ~ x - 1
## Model 3: y ~ x + I(x^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 8 0.05525
## 2 9 2.31266 -1 -2.2574 306.19 4.901e-07 ***
## 3 7 0.05161 2 2.2611 153.34 1.660e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Richard W. Hamming
Numerical Methods for Scientists and Engineers, 1987
119
120 The R language: Statistics
maries (summary()). All these methods accept numeric vectors and matrices as an
argument. Some of them also have definitions for other classes such as data frames
in the case of summary(). (The R language does not define a function for calculation
of the standard error of the mean. Please, see section 5.3.1 on page 168 for how
to define your own.)
x <- 1:20
mean(x)
var(x)
sd(x)
median(x)
mad(x)
mode(x)
max(x)
min(x)
range(x)
quantile(x)
length(x)
summary(x)
By default, if the argument contains NAs these functions return NA. The logic
behind this is that if one value exists but is unknown, the true result of the com-
putation is unknown (see page 25 for details on the role of NA in R). However, an
additional parameter called na.omit allows us to override this default behavior by
requesting any NA in the input to be omitted (or discarded) before calculation,
4.3 Distributions
Density, distribution functions, quantile functions and generation of pseudo-
random values for several different distributions are part of the R language. Enter-
ing help(Distributions) at the R prompt will open a help page describing all the
distributions available in base R. In what follows we use the Normal distribution
for the examples, but with slight differences in their parameters the functions for
Distributions 121
other theoretical distributions follow a consistent naming pattern. For each distri-
bution the different functions contain the same “root” in their names: norm for the
normal distribution, unif for the uniform distribution, and so on. The “head” of
the name indicates the type of values returned: “d” for density, “q” for quantile,
“r” (pseudo-)random numbers, and “p” for probabilities.
0.2
0.0
-1 0 1 2 3
x
122 The R language: Statistics
pnorm(q = 4, mean = 0, sd = 1)
## [1] 0.9999683
We see above that in the case of a symmetric distribution like the Normal,
the quantiles in the two tails differ only in sign. This is not the case for asym-
metric distributions.
When calculating a 𝑝-value from a quantile in a test of significance, we need
to first decide whether a two-sided or single-sided test is relevant, and in the
case of a single sided test, which tail is of interest. For a two-sided test we need
to multiply the returned value by 2.
pnorm(q = 4, mean = 0, sd = 1) * 2
## [1] 1.999937
rnorm(5)
## [1] "Z" "N" "Y" "R" "M" "E" "W" "J" "H" "G" "U" "O" "S" "T" "L" "F" "X" "P" "K"
## [1] "M" "S" "L" "R" "B" "D" "Q" "W" "V" "N" "J" "P"
## [1] "K" "E" "V" "N" "A" "Q" "L" "C" "T" "L" "H" "U"
4.5 Correlation
Both parametric (Pearson’s) and non-parametric robust (Spearman’s and Kendall’s)
methods for the estimation of the (linear) correlation between pairs of variables
are available in base R. The different methods are selected by passing arguments
to a single function. While Pearson’s method is based on the actual values of the
observations, non-parametric methods are based on the ordering or rank of the
observations, and consequently less affected by observations with extreme values.
We first load and explore the data set cars from R which we will use in the
example. These data consist of stopping distances for cars moving at different
speeds as described in the documentation available by entering help(cars)).
data(cars)
plot(cars)
120
20 40 60 80
dist
5 10 15 20 25
speed
4.5.1 Pearson’s 𝑟
Function cor() can be called with two vectors of the same length as arguments.
In the case of the parametric Pearson method, we do not need to provide further
arguments as this method is the default one.
It is also possible to pass a data frame (or a matrix) as the only argument.
126 The R language: Statistics
When the data frame (or matrix) contains only two columns, the returned value is
equivalent to that of passing the two columns individually as vectors.
cor(cars)
## speed dist
## speed 1.0000000 0.8068949
## dist 0.8068949 1.0000000
When the data frame or matrix contains more than two numeric vectors, the
returned value is a matrix of estimates of pairwise correlations between columns.
We here use rnorm() described above to create a long vector of pseudo-random
values drawn from the Normal distribution and matrix() to convert it into a matrix
with three columns (see page 51 for details about R matrices).
While cor() returns and estimate for 𝑟 the correlation coefficient, cor.test()
also computes the 𝑡-value, 𝑝-value, and confidence interval for the estimate.
##
##
## 0.6816422 0.8862036
## sample estimates:
## cor
## 0.8068949
As described below for model fitting and 𝑡-test, cor.test() also accepts a
formula plus data as arguments.
U Functions cor() and cor.test() return R objects, that when using R inter-
actively get automatically “printed” on the screen. One should be aware that
print() methods do not necessarily display all the information contained in an
R object. This is almost always the case for complex objects like those returned
by R functions implementing statistical tests. As with any R object we can save
Fitting linear models 127
a <- cor(cars)
class(a)
attributes(a)
str(a)
Methods class(), attributes() and str() are very powerful tools that can
be used when we are in doubt about the data contained in an object and/or how
it is structured. Knowing the structure allows us to retrieve the data members
directly from the object when predefined extractor methods are not available.
Function cor.test(), described above, also allows the choice of method with
the same syntax as shown for cor().
U Repeat the exercise in the playground immediately above, but now using
non-parametric methods. How does the information stored in the returned
matrix differ depending on the method, and how can we extract information
about the method used for calculation of the correlation from the returned
object.
4.6.1 Regression
In the example immediately below, speed is a continuous numeric variable. In the
ANOVA table calculated for the model fit, in this case a linear regression, we can
see that the term for speed has only one degree of freedom (df).
In the next example we continue using the stopping distance for cars data set
included in R. Please see the plot on page 125.
data(cars)
is.factor(cars$speed)
## [1] FALSE
is.numeric(cars$speed)
## [1] TRUE
class(fm1)
## [1] "lm"
The next step is diagnosis of the fit. Are assumptions of the linear model pro-
cedure used reasonably close to being fulfilled? In R it is most common to use
plots to this end. We show here only one of the four plots normally produced. This
quantile vs. quantile plot allows us to assess how much the residuals deviate from
being normally distributed.
Fitting linear models 129
plot(fm1, which = 2)
Normal Q-Q
49
3
23
Standardized residuals
35
2
1
0
-1
-2
-2 -1 0 1 2
Theoretical Quantiles
lm(dist ~ 1 + speed)
In the case of a regression, calling summary() with the fitted model object as
argument is most useful as it provides a table of coefficient estimates and their
errors. Remember that as is the case for most R functions, the value returned by
summary() is printed when we call this method at the R prompt.
summary(fm1)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
Let’s look at the printout of the summary, section by section. Under “Call:” we
find, dist ~ 1 + speed or the specification of the model fitted, plus the data used.
Under “Residuals:” we find the extremes, quartiles and median of the residuals, or
deviations between observations and the fitted line. Under “Coefficients:” we find
the estimates of the model parameters and their variation plus corresponding 𝑡-
tests. At the end of the summary there is information on degrees of freedom and
overall coefficient of determination (𝑅2 ).
If we return to the model formulation, we can now replace 𝛼 and 𝛽 by the
130 The R language: Statistics
estimates obtaining 𝑦 = −17.6 + 3.93𝑥. Given the nature of the problem, we know
based on first principles that stopping distance must be zero when speed is zero.
This suggests that we should not estimate the value of 𝛼 but instead set 𝛼 = 0, or
in other words, fit the model 𝑦 = 𝛽 ⋅ 𝑥.
However, in R models, the intercept is always implicitly included, so the model
fitted above can be formulated as dist ~ speed—i.e., a missing + 1 does not
change the model. To exclude the intercept from the previous model, we need to
specify it as dist ~ speed - 1, resulting in the fitting of a straight line passing
through the origin (𝑥 = 0, 𝑦 = 0).
fm2 <- lm(dist ~ speed - 1, data = cars)
summary(fm2)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
Now there is no estimate for the intercept in the summary, only an estimate for
the slope.
plot(fm2, which = 1)
Residuals vs Fitted
49
40
23
35
Residuals
20
0
-20
10 20 30 40 50 60 70
Fitted values
lm(dist ~ speed - 1)
The equation of the second fitted model is 𝑦 = 2.91𝑥, and from the residuals, it
Fitting linear models 131
can be seen that it is inadequate, as the straight line does not follow the curvature
of the relationship between dist and speed.
plot(fm3, which = 3)
summary(fm3)
anova(fm3)
The “same” fit using an orthogonal polynomial can be specified using func-
tion poly(). Polynomials of different degrees can be obtained by supplying
as the second argument to poly() the corresponding positive integer value.
In this case, the different terms of the polynomial are bulked together in the
summary.
summary(fm3a)
anova(fm3a)
We can also compare two model fits using anova(), to test whether one of
the models describes the data better than the other. It is important in this case
to take into consideration the nature of the difference between the model for-
mulas, most importantly if they can be interpreted as nested—i.e., interpreted
as a base model vs. the same model with additional terms.
anova(fm2, fm1)
We can use different criteria to choose the “best” model: significance based
on 𝑝-values or information criteria (AIC, BIC). AIC (Akaike’s “An Informa-
tion Criterion”) and BIC (“Bayesian Information Criterion” = SBC, “Schwarz’s
Bayesian criterion”) that penalize the resulting “goodness” based on the num-
ber of parameters in the fitted model. In the case of AIC and BIC, a smaller
value is better, and values returned can be either positive or negative, in which
case more negative is better. Estimates for both BIC and AIC are returned by
anova(), and on their own by BIC() and AIC()
Once you have run the code in the chunks above, you will be able see that
these three criteria do not necessarily agree on which is the “best” model. Find
in the output 𝑝-value, BIC and AIC estimates, for the different models and
conclude which model is favored by each of the three criteria. In addition you
will notice that the two different formulations of the quadratic polynomial are
equivalent.
The objects returned by model fitting functions are rather complex and
contain the full information, including the data to which the model was fit to.
The different functions described above, either extract parts of the object or
do additional calculations and formatting based on them. There are different
specializations of these methods which are called depending on the class of
the model-fit object. (See section 5.4 on page 172.)
class(fm1)
## [1] "lm"
str(anova(fm1))
## Classes 'anova' and 'data.frame': 2 obs. of 5 variables:
## $ Df : int 1 48
## $ Sum Sq : num 21185 11354
## $ Mean Sq: num 21185 237
## $ F value: num 89.6 NA
str(summary(fm1))
## List of 11
## $ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ coefficients : num [1:2, 1:4] -17.579 3.932 6.758 0.416 -2.601 ...
Once we know the structure of the object and the names of members, we
can simply extract them using the usual R rules for member extraction.
summary(fm1)$adj.r.squared
## [1] 0.6438102
show how to do the equivalent test with a null hypothesis of slope = 1. The
procedure is applicable to any constant value as a null hypothesis for any of
the fitted parameter estimates for hypotheses set a priori. The examples use
a two-sided test. In some cases, a single-sided test should be used (e.g., if its
known a priori that deviation is because of physical reasons possible only in
one direction away from the null hypothesis, or because only one direction of
response is of interest).
To estimate the t-value we need an estimate for the parameter and an esti-
mate of the standard error for this estimate and its degrees of freedom.
The t-test is based on the difference between the value of the null hypothesis
and the estimate.
hyp.null <- 1
t.value <- (est.slope.value - hyp.null) / est.slope.se
p.value <- dt(t.value, df = degrees.of.freedom)
U Check that the procedure above agrees with the output of summary()
when we set hyp.null <- 0 instead of hyp.null <- 1.
Modify the example so as to test whether the intercept is significantly larger
than 5 feet, doing a one-sided test.
Method predict() uses the fitted model together with new data for the indepen-
dent variables to compute predictions. As predict() accepts new data as input, it
allows interpolation and extrapolation to values of the independent variables not
present in the original data. In the case of fits of linear- and some other models,
method predict() returns, in addition to the prediction, estimates of the confi-
dence and/or prediction intervals. The new data must be stored in a data frame
with columns using the same names for the explanatory variables as in the data
used for the fit, a response variable is not needed and additional columns are
ignored. (The explanatory variables in the new data can be either continuous or
factors, but they must match in this respect those in the original data.)
U Predict using both fm1 and fm2 the distance required to stop cars
moving at 0, 5, 10, 20, 30, and 40 mph. Study the help page for the predict
method for linear models (using help(predict.lm)). Explore the difference be-
tween "prediction" and "confidence" bands: why are they so different?
Fitting linear models 135
data(InsectSprays)
is.numeric(InsectSprays$spray)
## [1] FALSE
is.factor(InsectSprays$spray)
## [1] TRUE
levels(InsectSprays$spray)
## [1] "A" "B" "C" "D" "E" "F"
We fit the model in exactly the same way as for linear regression; the difference
is that we use a factor as the explanatory variable. By using a factor instead of a
numeric vector, a different model matrix is built from an equivalent formula.
Diagnostic plots are obtained in the same way as for linear regression.
plot(fm4, which = 3)
Scale-Location
70
69
1.5
8
Standardized residuals
1.0
0.5
0.0
5 10 15
Fitted values
lm(count ~ spray)
In ANOVA we are mainly interested in testing hypotheses, and anova() pro-
vides the most interesting output. Function summary() can be used to extract pa-
rameter estimates. The default contrasts and corresponding 𝑝-values returned by
summary() test hypotheses that have little or no direct interest in an analysis of
variance. Function aov() is a wrapper on lm() that returns an object that by de-
fault when printed displays the output of anova().
136 The R language: Statistics
anova(fm4)
## Analysis of Variance Table
##
## Response: count
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2668.8 533.77 34.702 < 2.2e-16 ***
## Residuals 66 1015.2 15.38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The defaults used for model fits and ANOVA calculations vary among
programs. There exist different so-called “types” of sums of squares, usually
called I, II, and III. In orthogonal designs the choice has no consequences, but
differences can be important for unbalanced designs, even leading to different
conclusions. R’s default, type I, is usually considered to suffer milder problems
than type III, the default used by SPSS and SAS.
The contrasts used affect the estimates returned by coef() and summary()
applied to an ANOVA model fit. The default used in R is different to that used
in some other programs (even different than in S). The most straightforward
way of setting a different default for a whole series of model fits is by setting
R option contrasts, which we here only print.
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
It is also possible to select the contrast to be used in the call to aov() or lm().
The default, contr.treatment uses the first level of the factor (assumed to be a
control) as reference for estimation of coefficients and their significance, while
contr.sum uses as reference the mean of all levels, by using as condition that
the sum of the coefficient estimates is equal to zero. Obviously this changes
what the coefficients describe, and consequently also the estimated 𝑝-values.
Interpretation of any analysis has to take into account these differences and
users should not be surprised if ANOVA yields different results in base R and
SPSS or SAS given the different types of sums of squares used. The interpreta-
tion of ANOVA on designs that are not orthogonal will depend on which type
is used, so the different results are not necessarily contradictory even when
different.
Fitting linear models 137
summary(fm4trea)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(fm4sum)
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
In the case of contrasts, they always affect the parameter estimates inde-
pendently of whether the experiment design is orthogonal or not. A different
set of contrasts simply tests a different set of possible treatment effects. Con-
trasts, on the other hand, do not affect the table returned by anova() as this
table does not deal with the effects of individual factor levels.
138 The R language: Statistics
anova(fm10)
##
##
## Response: count
##
##
##
## NULL 71 409.04
The printout from the anova() method for GLM fits has some differences to that
for LM fits. By default, no significance test is computed, as a knowledgeable choice
is required depending on the characteristics of the model and data. We here use
"F" as an argument to request an 𝐹-test.
##
##
## Response: count
##
##
##
## NULL 71 409.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fm10, which = 3)
Scale-Location
27
39
1.5
Std. Pearson resid.
69
1.0
0.5
0.0
Predicted values
glm(count ~ spray)
We can extract different components similarly as described for linear models
(see section 4.6 on page 127).
class(fm10)
summary(fm10)
##
## Call:
##
## Deviance Residuals:
##
## Coefficients:
head(residuals(fm10))
## 1 2 3 4 5 6
## -1.2524891 -2.1919537 1.3650439 -0.1320721 -0.1320721 -0.6768988
head(fitted(fm10))
## 1 2 3 4 5 6
## 14.5 14.5 14.5 14.5 14.5 14.5
If we use str() or names() we can see that there are some differences
with respect to linear model fits. The returned object is of a different class and
contains some members not present in linear models. Two of these have to do
with the iterative approximation method used, iter contains the number of
iterations used and converged the success or not in finding a solution.
names(fm10)
## [1] "coefficients" "residuals" "fitted.values"
## [4] "effects" "R" "rank"
## [7] "qr" "family" "linear.predictors"
## [10] "deviance" "aic" "null.deviance"
## [13] "iter" "weights" "prior.weights"
## [16] "df.residual" "df.null" "y"
## [19] "converged" "boundary" "model"
## [22] "call" "formula" "terms"
## [25] "data" "offset" "control"
## [28] "method" "contrasts" "xlevels"
fm10$converged
## [1] TRUE
fm10$iter
## [1] 5
fitting the model to data. This is different from the shape of the function
when plotted—i.e., polynomials of any degree are linear models. In contrast, the
Michaelis-Menten equation used in chemistry and the Gompertz equation used to
describe growth are non-linear models in their parameters.
While analytical algorithms exist for finding estimates for the parameters of
linear models, in the case of non-linear models, the estimates are obtained by ap-
proximation. For analytical solutions, estimates can always be obtained, except in
infrequent pathological cases where reliance on floating point numbers with lim-
ited resolution introduces rounding errors that “break” mathematical algorithms
that are valid for real numbers. For approximations obtained through iteration,
cases when the algorithm fails to converge onto an answer are relatively common.
Iterative algorithms attempt to improve an initial guess for the values of the pa-
rameters to be estimated, a guess frequently supplied by the user. In each iteration
the estimate obtained in the previous iteration is used as the starting value, and
this process is repeated one time after another. The expectation is that after a fi-
nite number of iterations the algorithm will converge into a solution that “cannot”
be improved further. In real life we stop iteration when the improvement in the
fit is smaller than a certain threshold, or when no convergence has been achieved
after a certain maximum number of iterations. In the first case, we usually obtain
good estimates; in the second case, we do not obtain usable estimates and need to
look for different ways of obtaining them. When convergence fails, the first thing
to do is to try different starting values and if this also fails, switch to a different
computational algorithm. These steps usually help, but not always. Good starting
values are in many cases crucial and in some cases “guesses” can be obtained using
either graphical or analytical approximations.
For functions for which computational algorithms exist for “guessing” suit-
able starting values, R provides a mechanism for packaging the function to be
fitted together with the function generating the starting values. These functions
go by the name of self-starting functions and relieve the user from the burden of
guessing and supplying suitable starting values. The self-starting functions avail-
able in R are SSasymp(), SSasympOff(), SSasympOrig(), SSbiexp(), SSfol(), SSfpl(),
SSgompertz(), SSlogis(), SSmicmen(), and SSweibull(). Function selfStart() can
be used to define new ones. All these functions can be used when fitting models
with nls or nlme. Please, check the respective help pages for details.
In the case of nls() the specification of the model to be fitted differs from that
used for linear models. We will use as an example fitting the Michaelis-Menten
equation describing reaction kinetics in biochemistry and chemistry. The mathe-
matical formulation is given by:
data(Puromycin)
names(Puromycin)
## [1] "conc" "rate" "state"
142 The R language: Statistics
class(fm21)
## [1] "nls"
summary(fm21)
##
## Formula: rate ~ SSmicmen(conc, Vm, K)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## Vm 2.127e+02 6.947e+00 30.615 3.24e-11 ***
## K 6.412e-02 8.281e-03 7.743 1.57e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.93 on 10 degrees of freedom
##
## Number of iterations to convergence: 0
## Achieved convergence tolerance: 1.937e-06
residuals(fm21)
## [1] 25.4339970 -3.5660030 -5.8109606 4.1890394 -11.3616076 4.6383924
## [7] -5.6846886 -12.6846886 0.1670799 10.1670799 6.0311724 -0.9688276
## attr(,"label")
## [1] "Residuals"
fitted(fm21)
## attr(,"label")
If we use str() or names() we can see that there are differences with
respect to linear model and generalized model fits. The returned object is of
class nls and contains some new members and lacks others. Two members
are related to the iterative approximation method used, control containing
nested members holding iteration settings, and convInfo (convergence infor-
mation) with nested members with information on the outcome of the iterative
algorithm.
Model formulas 143
str(fm21, max.level = 1)
## List of 6
## $ m :List of 16
## $ convInfo :List of 5
## $ control :List of 5
fm21$convInfo
## $isConv
## [1] TRUE
##
## $finIter
## [1] 0
##
## $finTol
## [1] 1.937028e-06
##
## $stopCode
## [1] 0
##
## $stopMessage
## [1] "converged"
be interpreted as part of the model formula, and consequently does not require
additional protection, neither does the expression passed as its argument.
y ~ I(x1 + x2)
y ~ log(x1 + x2)
R formula syntax allows alternative ways for specifying interaction terms. They
allow “abbreviated” ways of entering formulas, which for complex experimental
designs saves typing and can improve clarity. As seen above, operator * saves us
from having to explicitly indicate all the interaction terms in a full factorial model.
y ~ x1 * x2 * x3
When the model to be specified does not include all possible interaction terms,
we can combine the concise notation with parentheses.
y ~ x1 + (x2 * x3)
y ~ x1 + x2 + x3 + x2:x3
That the two model formulas above are equivalent, can be seen using terms()
## y ~ x1 + (x2 * x3)
## attr(,"variables")
## attr(,"factors")
## x1 x2 x3 x2:x3
## y 0 0 0 0
## x1 1 0 0 0
## x2 0 1 0 1
## x3 0 0 1 1
## attr(,"term.labels")
## attr(,"order")
## [1] 1 1 1 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
y ~ x1 * (x2 + x3)
y ~ x1 + x2 + x3 + x1:x2 + x1:x3
## y ~ x1 * (x2 + x3)
## attr(,"variables")
## attr(,"factors")
## x1 x2 x3 x1:x2 x1:x3
## y 0 0 0 0 0
## x1 1 0 0 1 1
## x2 0 1 0 1 0
## x3 0 0 1 0 1
## attr(,"term.labels")
## [1] "x1" "x2" "x3" "x1:x2" "x1:x3"
## attr(,"order")
## [1] 1 1 1 2 2
## attr(,"intercept")
## [1] 1
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
The ^ operator provides a concise notation to limit the order of the interaction
terms included in a formula.
y ~ (x1 + x2 + x3)^2
y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3
y ~ (x1 + x2 + x3)^2
y ~ (x1 * x2 * x3)^2
Operator %in% can also be used as a shortcut for including only some of all the
possible interaction terms in a formula.
146 The R language: Statistics
y ~ x1 + x2 + x1 %in% x2
U Execute the examples below using the npk data set from R. They demon-
strate the use of different model formulas in ANOVA. Use these examples plus
your own variations on the same theme to build your understanding of the
syntax of model formulas. Based on the terms displayed in the ANOVA tables,
first work out what models are being fitted in each case. In a second step, write
each of the models using a mathematical formulation. Finally, think how model
choice may affect the conclusions from an analysis of variance.
data(npk)
of them specific to individual packages. These extensions fall outside the scope of
this book.
R will accept any syntactically correct model formula, even when the re-
sults of the fit are not interpretable. It is the responsibility of the user to en-
sure that models are meaningful. The most common, and dangerous, mistake
is specifying for factorial experiments, models that are missing lower-order
interactions.
Fitting models like those below to data from an experiment based on a
three-way factorial design should be avoided. In both cases simpler terms are
missing, while higher-order interaction(s) that include the missing term are in-
cluded in the model. Such models are not interpretable, as the variation from
the missing term(s) ends being “disguised” within the remaining terms, dis-
torting their apparent significance and parameter estimates.
In contrast to those above, the models below are interpretable, even if not
“full” models (not including all possible interactions).
class(y ~ x)
## [1] "formula"
a <- y ~ x
class(a)
## [1] "formula"
There is no method is.formula() in base R, but we can easily test the class of
an object with inherits().
inherits(a, "formula")
## [1] TRUE
as any other R objects, can be saved in variables including lists. Why is this use-
ful? For example, if we want to fit several different models to the same data,
we can write a for loop that walks through a list of model formulas. Or we can
write a function that accepts one or more formulas as arguments.
The use of for loops for iteration over a list of model formulas is described
in section 3.6 on page 115.
my.data <- data.frame(x = 1:10, y = (1:10) / 2 + rnorm(10))
str(anovas, max.level = 1)
## List of 3
## $ :List of 12
## $ :List of 12
## $ :List of 12
##
## Call:
##
## Coefficients:
## (Intercept) x
## 1.4059 0.2839
As there are many functions for the manipulation of character strings avail-
able in base R and through extension packages, it is straightforward to build
model formulas programmatically as strings. We can use functions like paste()
to assemble a formula as text, and then use as.formula() to convert it to an
object of class formula, usable for fitting a model.
my.string <- paste("y", "x", sep = "~")
lm(as.formula(my.string), data = my.data)
##
## Call:
##
## Coefficients:
## (Intercept) x
## 1.4059 0.2839
formatted.string
as.formula(formatted.string)
## y ~ x
It is also possible to edit formula objects with method update(). In the re-
placement formula, a dot can replace either the left-hand side (lhs) or the right-
hand side (rhs) of the existing formula in the replacement formula. We can also
remove terms as can be seen below. In some cases the dot corresponding to
the lhs can be omitted, but including it makes the syntax clearer.
my.formula <- y ~ x1 + x2
update(my.formula, . ~ . + x3)
## y ~ x1 + x2 + x3
update(my.formula, . ~ . - x1)
## y ~ x2
update(my.formula, . ~ x3)
## y ~ x3
update(my.formula, z ~ .)
## z ~ x1 + x2
update(my.formula, . + z ~ .)
## y + z ~ x1 + x2
options("contrasts")
## $contrasts
## unordered ordered
## "contr.treatment" "contr.poly"
A model matrix for a model for a two-way factorial design with no interac-
tion term:
model.matrix(~ A + B, treats.df)
## (Intercept) Ayes Bwhite
## 1 1 1 1
## 2 1 1 0
## 3 1 1 1
## 4 1 1 0
## 5 1 0 1
## 6 1 0 0
## 7 1 0 1
## 8 1 0 0
## attr(,"assign")
## [1] 0 1 2
## attr(,"contrasts")
## attr(,"contrasts")$A
## [1] "contr.treatment"
##
## attr(,"contrasts")$B
## [1] "contr.treatment"
A model matrix for a model for a two-way factorial design with interaction
term:
model.matrix(~ A * B, treats.df)
## (Intercept) Ayes Bwhite Ayes:Bwhite
## 1 1 1 1 1
## 2 1 1 0 0
## 3 1 1 1 1
## 4 1 1 0 0
## 5 1 0 1 0
## 6 1 0 0 0
## 7 1 0 1 0
## 8 1 0 0 0
## attr(,"assign")
## [1] 0 1 2 3
## attr(,"contrasts")
## attr(,"contrasts")$A
## [1] "contr.treatment"
##
## attr(,"contrasts")$B
## [1] "contr.treatment"
Time series 151
class(my.ts)
## [1] "ts"
str(my.ts)
We next use the data set austres with data on the number of Australian resi-
dents and included in R.
class(austres)
## [1] "ts"
is.ts(austres)
## [1] TRUE
plot(austres)
152 The R language: Statistics
16000
austres
13000
Time
A different example, using data set nottem containing meteorological data for
Nottingham, shows a clear cyclic component. The annual cycle of mean air tem-
peratures (in degrees Fahrenheit) is clear when data are plotted.
data(nottem)
is.ts(nottem)
## [1] TRUE
plot(nottem)
60
nottem
45
30
Time
In the next two code chunks, two different approaches to time series decom-
position are used. In the first one we use a moving average to capture the trend,
while in the second approach we use Loess (a smooth curve fitted by local weighted
regression) for the decomposition, a method for which the acronym STL (Seasonal
and Trend decomposition using Loess) is used. Before decomposing the time-
series we reexpress the temperatures in degrees Celsius.
10 15
data
5
0
seasonal
5
0
-5
trend
9.5
8.5
remainder
0 1 2
-2
1920 1925 1930 1935 1940
time
class(nottem.stl)
str(nottem.stl)
data(iris)
mmf1 <- lm(cbind(Petal.Length, Petal.Width) ~ Species, data = iris)
anova(mmf1)
##
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(mmf1)
## Response Petal.Length :
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
##
## Response Petal.Width :
##
## Call:
##
## Residuals:
##
## Coefficients:
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
summary(mmf2)
## Df Pillai approx F num Df den Df Pr(>F)
## Species 2 1.0465 80.661 4 294 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
U Modify the example above to use aov() instead of manova() and save
the result to a variable named mmf3. Use class(), attributes(), names(), str()
and extraction of members to explore objects mmf1, mmf2 and mmf3. Are they
different?
center = TRUE,
scale = TRUE)
By printing the returned object we can see the loadings of each variable in the
principal components P1 to P4.
class(pc)
## [1] "prcomp"
pc
## Standard deviations (1, .., p=4):
## [1] 1.7083611 0.9560494 0.3830886 0.1439265
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Sepal.Length 0.5210659 -0.37741762 0.7195664 0.2612863
## Sepal.Width -0.2693474 -0.92329566 -0.2443818 -0.1235096
## Petal.Length 0.5804131 -0.02449161 -0.1421264 -0.8014492
## Petal.Width 0.5648565 -0.06694199 -0.6342727 0.5235971
156 The R language: Statistics
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
Method biplot() produces a plot with one principal component (PC) on each
axis, plus arrows for the loadings.
biplot(pc)
-10 -5 0 5 10
61
0.2
42
10
94
58 54 63 120
9982 107 69
81
90 88
70
0.1
114
9 6091
80
5
14
39 93 73 147
95 135
68
83
1346
226 84 109
143
102
431 100
56 122
4310
4835 659772 124
74127115
112
3
30
36 85
89
9667 64134
79
98 55 129
133
PC2
6275 77
104
0.0
75024 92150
139
59 119Petal.Length
0
12
25 128 Petal.Width
827
40
29 76 117
148
78 105131
41
23 18 21
14432 71 138 146
113
108 123
38 28 66
87
52 53 103
130
141
537 140
116
111 142
22 865751 101144 106
121 136Sepal.Length
49
11 149 126
125
137145
-0.1
-5
47
2045
17619
3315
110
-10
34
-0.2
Sepal.Width
118
132
16
PC1
plot(pc)
Multivariate statistics 157
pc
Variances
Visually more elaborate plots of the principal components and their loadings
can be obtained using packages ‘ggplot’ described in chapter 7 starting on page
203. Package ‘ggfortify’ and package ‘ggbiplot’ extend ‘ggplot’ so as to make it easy
to plot principal components and their loadings.
str(pc, max.level = 1)
We can see that the returned object loc is a matrix, with names for one of the
dimensions.
158 The R language: Statistics
class(loc)
## [1] "matrix" "array"
dim(loc)
## [1] 21 2
dimnames(loc)
## [[1]]
## [1] "Athens" "Barcelona" "Brussels" "Calais"
## [5] "Cherbourg" "Cologne" "Copenhagen" "Geneva"
## [9] "Gibraltar" "Hamburg" "Hook of Holland" "Lisbon"
## [13] "Lyons" "Madrid" "Marseilles" "Milan"
## [17] "Munich" "Paris" "Rome" "Stockholm"
## [21] "Vienna"
##
## [[2]]
## NULL
head(loc)
## [,1] [,2]
## Athens 2290.27468 1798.8029
## Barcelona -825.38279 546.8115
## Brussels 59.18334 -367.0814
## Calais -82.84597 -429.9147
## Cherbourg -352.49943 -290.9084
## Cologne 293.68963 -405.3119
To make the code easier to read, two vectors are first extracted from the matrix
and named x and y. We force aspect to equality so that distances on both axes are
comparable.
x <- loc[, 1]
main = "cmdscale(eurodist)")
text(x, y, rownames(loc), cex = 0.6)
Multivariate statistics 159
cmdscale(eurodist)
2000
Stockholm
1000
Copenhagen
Hamburg
Hook of Holland
Calais Cologne
Brussels
Cherbourg
Paris
y
Lisbon Munich
Lyons Vienna
Geneva
Madrid
Marseilles Milan
Barcelona
Gibraltar
-1000
Rome
Athens
-2000
U Find data on the mean annual temperature, mean annual rainfall and
mean number of sunny days at each of the locations in the eurodist data set.
Next, compute suitable distance metrics, for example, using function dist. Fi-
nally, use MDS to visualize how similar the locations are with respect to each
of the three variables. Devise a measure of distance that takes into account the
three climate variables and use MDS to find how distant the different locations
are.
print(hc)
##
## Call:
## hclust(d = eurodist)
##
## Number of objects: 21
plot(hc)
Cluster Dendrogram
4000
3000
2000
Height
1000
Stockholm
Rome
Athens
0
Gibraltar
Lisbon
Madrid
Barcelona
Marseilles
Copenhagen
Hamburg
Cherbourg
Munich
Vienna
Milan
Calais
Paris
Cologne
Brussels
Hook of Holland
Geneva
Lyons
eurodist
hclust (*, "complete")
cutree(hc, k = 5)
## Athens Barcelona Brussels Calais Cherbourg
## 1 2 3 3 3
## Cologne Copenhagen Geneva Gibraltar Hamburg
## 3 4 2 5 4
## Hook of Holland Lisbon Lyons Madrid Marseilles
## 3 5 2 5 2
## Milan Munich Paris Rome Stockholm
## 2 3 3 1 4
## Vienna
## 3
The object returned by hclust() contains details of the result of the clustering,
which allows further manipulation and plotting.
str(hc)
## List of 7
## $ height : num [1:20] 158 172 269 280 328 428 460 460 521 668 ...
## $ dist.method: NULL
## - attr(*, "class")= chr "hclust"
5.2 Packages
5.2.1 Sharing of R-language extensions
The most elegant way of adding new features or capabilities to R is through pack-
ages. This is without doubt the best mechanism when these extensions to R need
to be shared. However, in most situations it is also the best mechanism for man-
aging code that will be reused even by a single person over time. R packages have
strict rules about their contents, file structure, and documentation, which makes
it possible among other things for the package documentation to be merged into
R’s help system when a package is loaded. With a few exceptions, packages can be
written so that they will work on any computer where R runs.
Packages can be shared as source or binary package files, sent for example
through e-mail. However, for sharing packages widely, it is best to submit them
163
164 The R language: Adding new “words”
Only objects exported by a package that has been attached are visible outside
its own namespace. Loading and attaching a package with library() makes the
exported objects available. Attaching a package adds the objects exported by the
package to the search path so that they can be accessed without prepending the
name of the namespace. Most packages do not export all the functions and objects
defined in their code; some are kept internal, in most cases because they may
change or be removed in future versions. Package namespaces can be detached
and also unloaded with function detach() using a slightly different notation for
the argument from that which we described for data frames in section 2.14.1 on
page 71.
U Use help to look up the help pages for install.packages() and library(),
and explain what the code in the next chunk does.
R packages can be installed either from sources, or from already built “bina-
ries”. Installing from sources, depending on the package, may require additional
software to be available. Under MS-Windows, the needed shell, commands and com-
pilers are not available as part of the operating system. Installing them is not dif-
ficult as they are available prepackaged in installers (you will need RTools, and
MiKTEX). It is easier to install packages from binary .zip files under MS-Windows.
Under Linux most tools will be available, or very easy to install, so it is usual to
install packages from sources. For OS X (Apple Mac) the situation is somewhere
in-between. If the tools are available, packages can be very easily installed from
sources from within RStudio. However, binaries are for most packages also readily
available.
of data analysis, the use of contributed packages can be unavoidable. Even in such
cases, it is not unusual to have alternatives to choose from within the available
contributed packages. Sometimes groups or suites of packages are designed to
work well together.
The CRAN repository has very broad scope and includes a section called
“views.” R views are web pages providing annotated lists of packages frequently
used within a given field of research, engineering or specific applications. These
views are edited and updated by different editors. They can be found at https:
//cran.r-project.org/web/views/.
The Bioconductor repository specializes in bioinformatics with R. It also has
a section with “views” and within it, descriptions of different data analysis work-
flows. The workflows are especially good as they reveal which sets of packages
work well together. These views can be found at https://www.bioconductor.
org/packages/release/BiocViews.html.
Although ROpenSci does not keep a separate package repository for the peer-
reviewed packages, they do keep an index of them at https://ropensci.org/
packages/.
The CRAN repository keeps an archive of earlier versions of packages, on an
individual package basis. METACRAN (https://www.r-pkg.org/) is an archive of
repositories, that keeps a historical record as snapshots from CRAN. METACRAN
uses a different search engine than CRAN itself, making it easier to search the
whole repository.
function in their place. We have been calling R functions or operators in almost ev-
ery example in this book; what we will next tackle is how to define new functions
of our own.
New functions and operators are defined using function function(), and saved
like any other object in R by assignment to a variable name. In the example below,
x and y are both formal parameters, or names used within the function for objects
that will be supplied as arguments when the function is called. One can think of
parameter names as placeholders for actual values to be supplied as arguments
when calling the function.
my.prod(4, 3)
## [1] 12
a <- 1
my.change(a)
## [1] 1
In general, any result that needs to be made available outside the function
must be returned by the function—or explicitly assigned to an object in the
enclosing environment (i.e., using <<- or assign()) as a side effect.
A function can only return a single object, so when multiple results are
produced they need to be collected into a single object. In many cases, lists
are used to collect all the values to be returned into one R object. For example,
model fit functions like lm(), discussed in section 4.6 on page 127, return a
complex list with heterogeneous named members.
print.x.1("test")
## [1] "test"
print.x.2("test")
## [1] "test"
## [1] "test"
print.x.3("test")
## [1] "test"
print.x.4("test")
## NULL
print.x.4("test")
## NULL
Functions have their own scope. Any names created by normal assignment
within the body of a function are visible only within the body of the function and
disappear when the function returns from the call. In normal use, functions in R
do not affect their environment through side effects. They receive input through
arguments and return a value as the result of the call. This value can be either
printed or assigned as we have seen when using functions earlier.
SEM(a)
## [1] 1.796988
SEM(a.na)
## [1] NA
sqrt(var(x)/length(x))
sem(x = a)
## [1] 1.796988
sem(x = a.na)
## [1] NA
R does not provide a function for standard error, so the function above is gener-
ally useful. Its user interface is consistent with that of functionally similar existing
functions. We have added a new word to the R vocabulary available to us.
In the definition of sem() we set a default argument for parameter na.omit which
is used unless the user explicitly passes an argument to this parameter.
U Define your own function to calculate the mean in a similar way as SEM()
was defined above. Hint: function sum() could be of help.
Functions can have much more complex and larger compound statements as
their body than those in the examples above. Within an expression, a function
name followed by parentheses is interpreted as a call to the function. The bare
name of a function instead gives access to its definition.
We first print (implicitly) the definition of our function from earlier in this sec-
tion.
170 The R language: Adding new “words”
sem
## if (na.omit) {
## x <- na.omit(x)
## }
## sqrt(var(x)/length(x))
## }
## <bytecode: 0x000000001c7dcd30>
Next we print the definition of R’s linear model fitting function lm(). (Use of
lm() is described in section 4.6 on page 127.)
lm
## {
## ret.x <- x
## ret.y <- y
## cl <- match.call()
## if (method == "model.frame")
## return(mf)
## w <- as.vector(model.weights(mf))
## ny <- if (mlm)
## nrow(y)
## else length(y)
## if (!is.null(offset)) {
## if (!mlm)
## if (NROW(offset) != ny)
## }
## if (is.empty.model(mt)) {
## x <- NULL
## 0) else ny)
## if (!is.null(offset)) {
## }
## }
## else {
## z <- if (is.null(w))
## ...)
## ...)
## }
## z$call <- cl
## z$terms <- mt
## if (model)
## z$model <- mf
## if (ret.x)
## z$x <- x
## if (ret.y)
## z$y <- y
## if (!qr)
## z
## }
## <bytecode: 0x00000000151d2418>
## <environment: namespace:stats>
As can be seen at the end of the listing, this function written in the R language
has been byte-compiled so that it executes faster. Functions that are part of the R
language, but that are not coded using the R language, are called primitives and
their full definition cannot be accessed through their name (c.f., sem() defined
above).
list
5.3.2 Operators
Operators are functions that use a different syntax for being called. If their name
is enclosed in back ticks they can be called as ordinary functions. Binary operators
like + have two formal parameters, and unary operators like unary - have only one
formal parameter. The parameters of many binary R operators are named e1 and
e2.
1 / 2
## [1] 0.5
`/`(1 , 2)
## [1] 0.5
`/`(e1 = 1 , e2 = 2)
## [1] 0.5
172 The R language: Adding new “words”
`/`
To print the definition, we enclose the name of our new operator in back
ticks—i.e., we back quote the special name.
`%-mean%`
## function(e1, e2) {
## e1 - mean(e2)
## }
tors,” they are like “nouns” in natural language. What we obtain with classes is the
possibility of defining multiple versions of functions (or methods) sharing the same
name but tailored to operate on objects belonging to different classes. We have al-
ready been using methods with multiple specializations throughout the book, for
example plot() and summary().
We start with a quotation from S Poetry (Burns 1998, page 13).
We say that specific methods are dispatched based on the class of the argument
passed. This, together with the loose type checks of R, allows writing code that
functions as expected on different types of objects, e.g., character and numeric
vectors.
R has good support for the object-oriented programming paradigm, but as a
system that has evolved over the years, currently R supports multiple approaches.
The still most popular approach is called S3, and a more recent and powerful ap-
proach, with slower performance, is called S4. The general idea is that a name like
“plot” can be used as a generic name, and that the specific version of plot() called
depends on the arguments of the call. Using computing terms we could say that
the generic of plot() dispatches the original call to different specific versions of
plot() based on the class of the arguments passed. S3 generic functions dispatch,
by default, based only on the argument passed to a single parameter, the first
one. S4 generic functions can dispatch the call based on the arguments passed
to more than one parameter and the structure of the objects of a given class is
known to the interpreter. In S3 functions, the specializations of a generic are rec-
ognized/identified only by their name. And the class of an object by a character
string stored as an attribute to the object.
We first explore one of the methods already available in R. The definition of
mean shows that it is the generic for a method.
mean
## UseMethod("mean")
## <bytecode: 0x00000000138ddc60>
## <environment: namespace:base>
We can find out which specializations of method are available in the current
search path using methods().
methods(mean)
## [1] mean.Date mean.default mean.difftime mean.POSIXct mean.POSIXlt
## see '?methods' for accessing help and source code
We can also use methods() to query all methods, including operators, defined
for objects of a given class.
174 The R language: Adding new “words”
methods(class = "list")
## [1] all.equal as.data.frame coerce Ops relist
## [6] type.convert within
## see '?methods' for accessing help and source code
a <- 123
class(a)
## [1] "numeric"
class(a)
Once a specialized method exists for a class, it will be used for objects of this
class.
print(a)
## [1] "[myclass] 123"
print(as.numeric(a))
## [1] 123
UseMethod("my_print", x)
print(class(x))
print(x, ...)
my_print(123)
## [1] "numeric"
## [1] 123
my_print("abc")
## [1] "character"
## [1] "abc"
print(x[rows, ], ...)
invisible(x)
We add the second statement so that the function invisibly returns the
whole data frame, rather than the lines printed. We now do a quick test of
the function.
my_print(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
my_print(cars, 8:10)
## speed dist
## 8 10 26
## 9 10 34
## 10 11 17
b <- my_print(cars)
## speed dist
## 1 4 2
## 2 4 10
## 3 7 4
## 4 7 22
## 5 8 16
str(b)
pi
## [1] 3.141593
pi
rm(pi)
pi
## [1] 3.141593
exists("pi")
## [1] TRUE
In the example above, the two variables are not defined in the same scope. In the
example below we assign a new value to a variable we have earlier created within
the same scope, and consequently the second assignment overwrites, rather than
hides, the existing definition.
Further reading 177
my.pie
my.pie
rm(my.pie)
exists("my.pie")
## [1] FALSE
Patrick J. Burns
S Poetry, 1998
6.2 Introduction
By reading previous chapters, you have already become familiar with base R
classes, methods, functions and operators for storing and manipulating data. Most
of these had been originally designed to perform optimally on rather small data
sets (see Matloff 2011). The R implementation has been improved over the years
179
180 New grammars of data
looked up. There is no easy answer; a simplified syntax leads to ambiguity, and
a fully specified syntax is verbose. Recent versions of the package introduced a
terse syntax to achieve a concise way of specifying where to look up names. My
opinion is that for code that needs to be highly reliable and produce reproducible
results in the future, we should for the time being prefer base R. For code that is
to be used once, or for which reproducibility can depend on the use of a specific
(old or soon to be old) version of ‘dplyr’, or which is not a burden to update, the
conciseness and power of the new syntax will be an advantage.
In this chapter you will become familiar with alternative “grammars of data” as
implemented in some of the packages that enable new approaches to manipulating
data in R. As in previous chapters I will focus more on the available tools and how
to use them than on their role in the analysis of data. The books R for Data Science
(Wickham and Grolemund 2017) and R Programming for Data Science (Peng 2016)
partly cover the same subjects from the perspective of data analysis.
install.packages(learnrbook::pkgs_ch_data)
To run the examples included in this chapter, you need first to load some pack-
ages from the library (see section 5.2 on page 163 for details on the use of pack-
ages).
library(learnrbook)
library(tibble)
library(magrittr)
library(wrapr)
library(stringr)
library(dplyr)
library(tidyr)
library(lubridate)
6.4.2 ‘tibble’
The authors of package ‘tibble’ describe their tbl class as backwards compatible
with data.frame and make it a derived class. This backwards compatibility is only
partial so in some situations data frames and tibbles are not equivalent.
The class and methods that package ‘tibble’ defines lift some of the restric-
tions imposed by the design of base R data frames at the cost of creating some
incompatibilities due to changed (improved) syntax for member extraction and by
adding support for “columns” of class list and removing support for columns of
class matrix. Handling of attributes is also different, with no row names added by
default. There are also differences in default behavior of both constructors and
methods. Although, objects of class tbl can be passed as arguments to functions
that expect data frames as input, these functions are not guaranteed to work cor-
rectly as a result of the differences in syntax.
It is easy to write code that will work correctly both with data frames and
tibbles. However, code that is syntactically correct according to the R language
may fail if a tibble is used in place of a data frame.
The print() method for tibbles differs from that for data frames in that it
outputs a header with the text “A tibble:” followed by the dimensions (number
of rows × number of columns), adds under each column name an abbreviation
of its class and instead of printing all rows and columns, a limited number of
them are displayed. In addition, individual values are formatted differently
even adding color highlighting for negative numbers.
Replacements for data.frame 183
The default number of rows printed can be set with options, that we set
here to only three rows for most of this chapter.
options(tibble.print_max = 3, tibble.print_min = 3)
= In their first incarnation, the name for tibble was data_frame (with a dash
instead of a dot). The old name is still recognized, but its use should be avoided
and tibble() used instead. One should be aware that although the constructor
tibble() and conversion function as_tibble(), as well as the test is_tibble()
use the name tibble, the class attribute is named tbl.
is_tibble(my.tb)
## [1] TRUE
inherits(my.tb, "tibble")
## [1] FALSE
class(my.tb)
We start with the constructor and conversion methods. For this we will define
our own diagnosis function (apply functions are described in section 3.4 on page
108).
paste(paste(class(x)[1],
"containing:"),
paste(names(x),
sep = "\n")
In the next two chunks we can see some of the differences. The tibble()
184 New grammars of data
constructor does not by default convert character data into factors, while the
data.frame() constructor does.
my.df <- data.frame(codes = c("A", "B", "C"), numbers = 1:3, integers = 1L:3L)
is.data.frame(my.df)
## [1] TRUE
is_tibble(my.df)
## [1] FALSE
show_classes(my.df)
## data.frame containing:
Tibbles are data frames—or more formally class tibble is derived from class
data.frame. However, data frames are not tibbles.
my.tb <- tibble(codes = c("A", "B", "C"), numbers = 1:3, integers = 1L:3L)
is.data.frame(my.tb)
## [1] TRUE
is_tibble(my.tb)
## [1] TRUE
show_classes(my.tb)
## tbl_df containing:
The print() method for tibbles, overrides the one defined for data frames.
print(my.df)
## codes numbers integers
## 1 A 1 1
## 2 B 2 2
## 3 C 3 3
print(my.tb)
## # A tibble: 3 x 3
## codes numbers integers
## <chr> <int> <int>
## 1 A 1 1
## 2 B 2 2
## 3 C 3 3
U Tibbles and data frames differ in how they are printed when they have
many rows or columns. 1) Construct a data frame and an equivalent tibble with
at least 50 rows and then test how the output looks when they are printed. 2)
Construct a data frame and an equivalent tibble with more columns than will
fit in the width of the Rconsole and then test how the output looks when they
are printed.
is.data.frame(my_conv.tb)
## [1] TRUE
is_tibble(my_conv.tb)
## [1] TRUE
show_classes(my_conv.tb)
## tbl_df containing:
is.data.frame(my_conv.df)
## [1] TRUE
is_tibble(my_conv.df)
## [1] FALSE
show_classes(my_conv.df)
## data.frame containing:
U Look carefully at the result of the conversions. Why do we now have a data
frame with A as character and a tibble with A as a factor?
class(my.tb)
class(my_conv.df)
## [1] "data.frame"
my.tb == my_conv.df
identical(my.tb, my_conv.df)
## [1] FALSE
Now we derive from a tibble, and then attempt a conversion back into a
tibble.
class(my.xtb)
class(my_conv_x.tb)
my.xtb == my_conv_x.tb
identical(my.xtb, my_conv_x.tb)
## [1] FALSE
While data frame columns can be factors, vectors or matrices (with the same
number of rows as the data frame), columns of tibbles can be factors, vectors or
lists (with the same number of members as rows the tibble has).
## # A tibble: 5 x 3
## a b c
## 1 1 5 <chr [1]>
## 2 2 4 <dbl [1]>
## 3 3 3 <dbl [1]>
## # A tibble: 5 x 3
## a b c
## 1 1 5 <chr [1]>
## 2 2 4 <int [2]>
## 3 3 3 <int [4]>
How can pipes exist within a single R script? When chaining functions into a
pipe, data is passed between them through temporary R objects stored in memory,
which are created and destroyed automatically. Conceptually there is little differ-
ence between Unix shell pipes and pipes in R scripts, but the implementations are
different.
What do pipes achieve in R scripts? They relieve the user from the responsibility
of creating and deleting the temporary objects and of enforcing the sequential
execution of the different steps. Pipes usually improve readability of scripts by
allowing more concise code.
Currently, two main implementations of pipes are available as R extensions, in
packages ‘magrittr’ and ‘wrapr’.
6.5.1 ‘magrittr’
One set of operators needed to build pipes of R functions is implemented in pack-
age ‘magrittr’. This implementation is used in the ‘tidyverse’ and the pipe operator
re-exported by package ‘dplyr’.
We start with a toy example first written using separate steps and normal R
syntax
data.in <- 1:10
The %>% from package ‘magrittr’ takes two operands. The value returned
by the lhs (left-hand side) operand, which can be any R expression, is passed
as first argument to the rhs operand, which must be a function accepting at
least one argument. Consequently, in this implementation, the function in the
rhs must have a suitable signature for the pipe to work implicitly as usually
used. However, it is possible to pass piped arguments to a function by name
or to other parameters than the first one using a dot (.) as placeholder.
Some base R functions like subset() have a signature that is suitable for
use in ‘magrittr’ pipes using implicit passing of the piped value to the first ar-
gument, while others such as assign() will not. In such cases we can use . as
a placeholder and pass it as an argument, or, alternatively, define a wrapper
function to change the order of the formal parameters in the function signa-
ture.
Data pipes 189
6.5.2 ‘wrapr’
The %.>%, or “dot-pipe” operator from package ‘wrapr’, allows expressions both on
the rhs and lhs, and enforces the use of the dot (.), as placeholder for the piped
object.
Rewritten using the dot-pipe operator, the pipe in the previous chunk becomes
However, the same code can use the pipe operator from ‘magrittr’.
If needed or desired, named arguments are supported with the dot-pipe opera-
tor resulting in the expected behavior.
all.equal(data.in, data3.out)
## [1] TRUE
In contrast, the pipe operator silently and unexpectedly fails to create the vari-
able for the same example.
exists("data4.out")
## [1] FALSE
Under-the-hood, the implementations of %>% and %.>% are very different, with
%.>% usually having better performance.
In the rest of the book we will exclusively use dot pipes in examples to ensure
easier understanding as they avoid implicit (”invisible”) passing of arguments and
impose fewer restrictions on the syntax that can be used.
190 New grammars of data
Although pipes can make scripts visually very different from the use of assign-
ments of intermediate results to variables, from the point of view of data analysis
what makes pipes most convenient to use are some of the new classes, functions,
and methods defined in ‘tidyr’, ‘dplyr’, and other packages from the ‘tidyverse’.
Function gather() converts data from wide form into long form (or ”tidy”). We
use gather to obtain a long-form tibble. By comparing iris.tb with long_iris.tb
we can appreciate how gather() reshaped its input.
head(iris.tb, 2)
## # A tibble: 2 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
iris.tb %.>%
Reshaping with ‘tidyr’ 191
long_iris.tb_2
## # A tibble: 600 x 3
This syntax has been recently subject to debate and led to John Mount de-
veloping package ‘seplyr’ which provides wrappers on functions and meth-
ods from ‘dplyr’ that respect standard evaluation (SE). At the time of writing,
‘seplyr’ can be considered as experimental.
192 New grammars of data
For the reverse operation, converting from long form to wide form, we use
spread().
Starting from version 1.0.0 of ‘tidyr’, gather() and spread() are depre-
cated and replaced by pivot_longer() and pivot_wider(). These new functions
use a different syntax but are not yet fully stable.
The first advantage a user of the ‘dplyr’ functions and methods sees is
the completeness of the set of operations supported and the symmetry and
consistency among the different functions. A second advantage is that almost
all the functions are defined not only for objects of class tibble, but also for
objects of class data.table (packages ‘dtplyr’) and for SQL databases (‘dbplyr’),
with consistent syntax (see also section 8.14 on page 325). A further variant
exists in package ‘seplyr’, supporting a different syntax stemming from the
use of “standard evaluation” (SE) instead of non-standard evaluation (NSE). A
downside of ‘dplyr’ and much of the ‘tidyverse’ is that the syntax is not yet fully
stable. Additionally, some function and method names either override those in
base R or clash with names used in other packages. R itself is extremely stable
and expected to remain forward and backward compatible for a long time. For
code intended to remain in use for years, the fewer packages it depends on,
the less maintenance it will need. When using the ‘tidyverse’ we need to be
prepared to revise our own dependent code after any major revision to the
‘tidyverse’ packages we may use.
= A new package, ‘poorman’, implements many of the same words and gram-
mar as ‘dplyr’ using pure R in the implementation instead of compiled C++
Data manipulation with ‘dplyr’ 193
and C code. This light-weight approach could be useful when dealing with rel-
atively small data sets or when the use of R’s data frames instead of tibbles is
preferred.
tibble(a = 1:5, b = 2 * a)
## # A tibble: 5 x 2
## a b
## <int> <dbl>
## 1 1 2
## 2 2 4
## 3 3 6
## # ... with 2 more rows
Continuing with the example from the previous section, we most likely would
like to split the values in variable part into plant_part and part_dim. We use
mutate() from ‘dplyr’ and str_extract() from ‘stringr’. We use regular expres-
sions as arguments passed to pattern. We do not show it here, but mutate() can
be used with variables of any mode, and calculations can involve values from sev-
eral columns. It is even possible to operate on values applying a lag or, in other
words, using rows displaced relative to the current one.
long_iris.tb %.>%
mutate(.,
plant_part = str_extract(part, "^[:alpha:]*"),
part_dim = str_extract(part, "[:alpha:]*$")) -> long_iris.tb
long_iris.tb
## # A tibble: 600 x 5
## Species part dimension plant_part part_dim
## <fct> <chr> <dbl> <chr> <chr>
## 1 setosa Sepal.Length 5.1 Sepal Length
## 2 setosa Sepal.Length 4.9 Sepal Length
## 3 setosa Sepal.Length 4.7 Sepal Length
## # ... with 597 more rows
194 New grammars of data
In the next few chunks, we print the returned values rather than saving them
in variables. In normal use, one would combine these functions into a pipe using
operator %.>% (see section 6.5 on page 187).
Function arrange() is used for sorting the rows—makes sorting a data frame
or tibble simpler than by using sort() and order(). Here we sort the tibble
long_iris.tb based on the values in three of its columns.
select(iris.tb, -starts_with("Sepal"))
## # A tibble: 150 x 3
## Petal.Length Petal.Width Species
## <dbl> <dbl> <fct>
## 1 1.4 0.2 setosa
## 2 1.4 0.2 setosa
## 3 1.3 0.2 setosa
## # ... with 147 more rows
of this can lead to erroneous and surprising results from calculations. Do not
save grouped tibbles or data frames, and always make sure that inputs and
outputs, at the head and tail of a pipe, are not grouped, by using ungroup()
when needed.
The first step is to use group_by() to “tag” a tibble with the grouping. We create
a tibble and then convert it into a grouped tibble. Once we have a grouped tibble,
function summarise() will recognize the grouping and use it when the summary
values are calculated.
median_numbers = median(numbers),
n = n())
## # A tibble: 3 x 4
## letters mean_numbers median_numbers n
## * <chr> <dbl> <int> <int>
## 1 a 4 4 3
## 2 b 5 5 3
## 3 c 6 6 3
How is grouping implemented for data frames and tibbles? In our case as
our tibble belongs to class tibble_df, grouping adds grouped_df as the most
derived class. It also adds several attributes with the grouping information in a
format suitable for fast selection of group members. To demonstrate this, we
need to make an exception to our recommendation above and save a grouped
tibble to a variable.
is.grouped_df(my.tb)
## [1] FALSE
class(my.tb)
names(attributes(my.tb))
is.grouped_df(my_gr.tb)
## [1] TRUE
class(my_gr.tb)
names(attributes(my_gr.tb))
setdiff(attributes(my_gr.tb), attributes(my.tb))
## [[1]]
## # A tibble: 3 x 2
## letters .rows
## * <chr> <list<int>>
## 1 a [3]
## 2 b [3]
## 3 c [3]
##
## [[2]]
class(my_ugr.tb)
names(attributes(my_ugr.tb))
all(my.tb == my_gr.tb)
## [1] TRUE
all(my.tb == my_ugr.tb)
## [1] TRUE
identical(my.tb, my_gr.tb)
## [1] FALSE
identical(my.tb, my_ugr.tb)
## [1] TRUE
The tests above show that members are in all cases the same as operator
== tests for equality at each position in the tibble but not the attributes, while
attributes, including class differ between normal tibbles and grouped ones
and so they are not identical objects.
If we replace tibble by data.frame in the first statement, and rerun the
chunk, the result of the last statement in the chunk is FALSE instead of TRUE.
At the time of writing starting with a data.frame object, applying grouping
with group_by() followed by ungrouping with ungroup() has the side effect
of converting the data frame into a tibble. This is something to be very much
aware of, as there are differences in how the extraction operator [ , ] behaves
in the two cases. The safe way to write code making use of functions from
‘dplyr’ and ‘tidyr’ is to always use tibbles instead of data frames.
198 New grammars of data
6.7.3 Joins
Joins allow us to combine two data sources which share some variables. Vari-
ables in common are used to match the corresponding rows before “joining” vari-
ables (i.e., columns) from both sources together. There are several join functions
in ‘dplyr’. They differ mainly in how they handle rows that do not have a match
between data sources.
We create here some artificial data to demonstrate the use of these functions.
We will create two small tibbles, with one column in common and one mismatched
row in each.
## Joining, by = "idx"
## # A tibble: 6 x 3
## idx values1 values2
## * <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 5 a <NA>
## 6 6 <NA> b
## Joining, by = "idx"
## # A tibble: 6 x 3
## idx values2 values1
## * <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 6 b <NA>
## 6 5 <NA> a
Left and right joins retain rows not matched from only one of the two data
sources, x and y, respectively.
Data manipulation with ‘dplyr’ 199
## Joining, by = "idx"
## # A tibble: 5 x 3
## idx values1 values2
## * <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 5 a <NA>
## Joining, by = "idx"
## # A tibble: 5 x 3
## idx values2 values1
## * <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 6 b <NA>
## Joining, by = "idx"
## # A tibble: 5 x 3
## idx values1 values2
## * <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## 5 6 <NA> b
## Joining, by = "idx"
## # A tibble: 5 x 3
## idx values2 values1
## * <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
## 5 5 <NA> a
An inner join discards all rows in x that do not have a matching row in y and
vice versa.
## Joining, by = "idx"
200 New grammars of data
## # A tibble: 4 x 3
## idx values1 values2
## * <dbl> <chr> <chr>
## 1 1 a b
## 2 2 a b
## 3 3 a b
## 4 4 a b
## Joining, by = "idx"
## # A tibble: 4 x 3
## idx values2 values1
## * <dbl> <chr> <chr>
## 1 1 b a
## 2 2 b a
## 3 3 b a
## 4 4 b a
Next we apply the filtering join functions exported by ‘dplyr’: semi_join() and
anti_join(). These functions only return a tibble that always contains only the
columns from x, but retains rows based on their match to rows in y.
A semi join retains rows from x that have a match in y.
semi_join(x = first.tb, y = second.tb)
## Joining, by = "idx"
## # A tibble: 4 x 2
## idx values1
## <dbl> <chr>
## 1 1 a
## 2 2 a
## 3 3 a
## 4 4 a
## Joining, by = "idx"
## # A tibble: 4 x 2
## idx values2
## <dbl> <chr>
## 1 1 b
## 2 2 b
## 3 3 b
## 4 4 b
## Joining, by = "idx"
## # A tibble: 1 x 2
## idx values1
## <dbl> <chr>
## 1 5 a
Further reading 201
## Joining, by = "idx"
## # A tibble: 1 x 2
## idx values2
## <dbl> <chr>
## 1 6 b
Grammar of graphics
install.packages(learnrbook::pkgs_ch_ggplot)
203
204 Grammar of graphics
To run the examples included in this chapter, you need first to load some pack-
ages from the library (see section 5.2 on page 163 for details on the use of pack-
ages).
library(learnrbook)
library(wrapr)
library(scales)
library(ggplot2)
library(ggrepel)
library(gginnards)
library(ggpmisc)
library(ggbeeswarm)
library(ggforce)
library(tikzDevice)
library(lubridate)
library(tidyverse)
library(patchwork)
7.3.1 Data
The data to be plotted must be available as a data.frame or tibble, with data stored
so that each row represents a single observation event, and the columns are dif-
ferent values observed in that single event. In other words, in long form (so-called
“tidy data”) as described in chapter 6. The variables to be plotted can be numeric,
factor, character, and time or date stored as POSIXct.
7.3.2 Mapping
When we design a plot, we need to map data variables to aesthetics (or graphic
properties). Most plots will have an 𝑥 dimension, which is considered an aesthetic,
and a variable containing numbers mapped to it. The position on a 2D plot of, say, a
point, will be determined by 𝑥 and 𝑦 aesthetics, while in a 3D plot, three aesthetics
need to be mapped 𝑥, 𝑦 and 𝑧. Many aesthetics are not related to coordinates,
they are properties, like color, size, shape, line type, or even rotation angle, which
add an additional dimension on which to represent the values of variables and/or
constants.
7.3.3 Geometries
Geometries are “words” that describe the graphics representation of the data:
for example, geom_point(), plots a point or symbol for each observation, while
geom_line(), draws line segments between observations. Some geometries rely by
default on statistics, but most “geoms” default to the identity statistics. Each time
a geometry is used to add a graphical representation of data to a plot, we say that
a new layer has been added. The name layer reflects the fact that each new layer
added is plotted on top of the layers already present in the plot, or rather when a
plot is printed the layers will be generated in the order they were added to the gg-
plot object. For example, one layer in a plot can display the observations, another
layer a regression line fitted to them, and a third one may contain annotations such
an equation or a text label.
7.3.4 Statistics
Statistics are “words” that represent calculation of summaries or some other oper-
ation on the values from the data. When statistics are used for a computation, the
returned value is passed directly to a geometry, and consequently adding an statis-
tics also adds a layer to the plot. For example, stat_smooth() fits a smoother, and
stat_summary() applies a summary function. Statistics are applied automatically
206 Grammar of graphics
by group when data have been grouped by mapping additional aesthetics such as
color to a factor.
7.3.5 Scales
Scales give the “translation” or mapping between data values and the aesthetic
values to be actually plotted. Mapping a variable to the “color” aesthetic (also
recognized when spelled as “colour”) only tells that different values stored in
the mapped variable will be represented by different colors. A scale, such as
scale_color_continuous(), will determine which color in the plot corresponds to
which value in the variable. Scales can also define transformations on the data,
which are used when mapping data values to aesthetic values. All continuous scales
support transformations—e.g., in the case of 𝑥 and 𝑦 aesthetics, positions on the
plotting region or viewport will be affected by the transformation, while the origi-
nal values will be used for tick labels along the axes. Scales are used for all aesthet-
ics, including continuous variables, such as numbers, and categorical ones such as
factors. The grammar of graphics allows only one scale per aesthetic and plot. This
restriction is imposed by design to avoid ambiguity (e.g., it ensures that the red
color will have the same “meaning” in all plot layers where the color aesthetic is
mapped to data). Scales have limits with observations falling outside these limits
being ignored (replaced by NA) rather than passed to statistics or geometries—it
is easy to unintentionally drop observations when setting scale limits manually as
warning messages report that NA values have been omitted.
7.3.7 Themes
How the plots look when displayed or printed can be altered by means of themes.
A plot can be saved without adding a theme and then printed or displayed using
different themes. Also, individual theme elements can be changed, and whole new
themes defined. This adds a lot of flexibility and helps in the separation of the data
representation aspects from those related to the graphical design.
Introduction to the grammar of graphics 207
ggplot()
The plot above is of little use without any data, so we next pass a data frame
object, in this case mtcars—mtcars is a data set included in R; to learn more about
this data set, type help("mtcars") at the R command prompt.
ggplot(data = mtcars)
Once the data are available, we need to map the quantities in the data onto
graphical features in the plot, or aesthetics. When plotting in two dimensions, we
need to map variables in the data to at least the 𝑥 and 𝑦 aesthetics. This map-
ping can be seen in the chunk below by its effect on the plotting area ranges that
now match the ranges of the mapped variables, extended by a small margin. The
axis labels also reflect the names of the mapped variables, however, there is no
graphical element yet displayed for the individual observations.
ggplot(data = mtcars,
aes(x = disp, y = mpg))
208 Grammar of graphics
35
30
25
mpg
20
15
10
100 200 300 400
disp
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point()
35
30
25
mpg
20
15
10
100 200 300 400
disp
print(p)
U Above we have seen how to build a plot, layer by layer, using the
grammar of graphics. We have also seen how to save a ggplot. We can peep
into the innards of this object using summary().
Introduction to the grammar of graphics 209
summary(p)
Although aesthetics can be mapped to variables in the data, they can also be set
to constant values, but only within layers, not as whole-plot defaults.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point(color = "red", shape = "square")
35
30
25
mpg
20
15
10
100 200 300 400
disp
35
30
25
mpg
20
15
10
We haven’t yet added some of the elements of the grammar described above:
scales, coordinates and themes. The plots were rendered anyway because these
elements have defaults which are used when we do not set them explicitly. We
next will see examples in which they are explicitly set. We start with a scale using
a logarithmic transformation. This works like plotting by hand using graph paper
with rulings spaced according to a logarithmic scale. Tick marks continue to be
expressed in the original units, but statistics are applied to the transformed data.
In other words, a transformed scale affects the values before they are passed to
statistics, and the linear regression will be fitted to log10() transformed 𝑦 values
and the original 𝑥 values.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
scale_y_log10()
30
20
mpg
10
100 200 300 400
disp
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_cartesian(ylim = c(15, 25))
Introduction to the grammar of graphics 211
25.0
22.5
mpg
20.0
17.5
15.0
100 200 300 400
disp
The next example uses a coordinate system transformation. When the trans-
formation is applied to the coordinate system, it affects only the plotting—it sits
between the geom and the rendering of the plot. The transformation is applied to
the values returned by any statistics. The straight line fitted is plotted on the trans-
formed coordinates as a curve, because the model was fitted to the untransformed
data and this fitted model is automatically used to obtain the predicted values,
which are then plotted after the transformation is applied to them. We have here
described only cartesian coordinate systems while other coordinate systems are
described in sections 7.4.6 and 7.9 on pages 228 and 272, respectively.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
coord_trans(y = "log10")
30
20
mpg
10
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
theme_classic()
30
25
mpg 20
15
10
100 200 300 400
disp
We can also override the base font size and font family. This affects the size of
all text elements, as their size is defined relative to the base size. Here we add the
same theme as used in the previous example, but with a different base point size
for text.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
theme_classic(base_size = 20, base_family = "serif")
35
30
25
mpg
20
15
10
100 200 300 400
disp
The details of how to set axis labels, tick positions and tick labels will be dis-
cussed in depth in section 7.7. Meanwhile, we will use function labs() which is a
convenience function allowing us to easily set the title and subtitle of a plot and to
replace the default name of scales used for axis labels—by default name is set to the
name of the mapped variable. When setting the name of scales with labs(), we use
as parameter names the names of aesthetics and pass as an argument a character
string, or an R expression. Here we use x and y, the names of the two aesthetics to
which we have mapped two variables in data, disp and mpg.
ggplot(data = mtcars,
aes(x = disp, y = mpg)) +
geom_point() +
labs(x = "Engine displacement (cubic inches)",
y = "Fuel use efficiency\n(miles per gallon)",
title = "Motor Trend Car Road Tests",
subtitle = "Source: 1974 Motor Trend US magazine")
Introduction to the grammar of graphics 213
25
20
15
10
100 200 300 400
Engine displacement (cubic inches)
35
30
25
mpg
20
15
10
20 30 40 50 60
disp/cyl
Each of the elements of the grammar exemplified above has several different
members, and many of the individual geometries and statistics accept arguments
that can be used to modify their behavior. There are also more aesthetics than
those shown above. Multiple data objects as well as multiple mappings can coexist
within a single ggplot object. Packages and user code can define new geometries,
statistics, coordinates and even implement new aesthetics. Individual elements in a
theme can also be modified and new complete themes created, re-used and shared.
We will describe in the remaining sections of this chapter how to use the grammar
of graphics to construct other types of graphical presentations including more
complex plots than those in the examples above.
214 Grammar of graphics
str(p, max.level = 1)
When we used in the previous section operator + to assemble the plots, we were
operating on “anonymous” R objects. In the same way, we can operate on saved or
“named” objects.
p +
stat_smooth(geom = "line", method = "lm", formula = y ~ x)
35
30
25
mpg
20
15
10
= In the examples above we have been adding elements one by one, using
the + operator. It is also possible to add multiple components in a single op-
eration using a list. This is useful, when we want to save sets of components
in a variable so as to reuse them in multiple plots. This saves typing, ensures
consistency and can make alterations to a set of similar plots much easier.
p + my.layers
35
30
25
mpg
20
15
10
100 300 500
disp
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point()
ggplot() +
geom_point(data = mtcars,
mapping = aes(x = disp, y = mpg))
The default mapping can also be added directly with the + operator, instead of
being passed as an argument to ggplot().
ggplot(data = mtcars) +
aes(x = disp, y = mpg) +
geom_point()
It is even possible to have a default mapping for the whole plot, but no default
data.
ggplot() +
aes(x = disp, y = mpg) +
geom_point(data = mtcars)
In these examples, the plot remains unchanged, but this flexibility in the gram-
mar allows, in plots containing multiple layers, for each layer to use different data
or a different mapping.
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(size = 4) +
geom_point(data = function(x){subset(x, cyl == 4)}, color = "yellow",
size = 1.5)
The plot default data can also be operated upon using the ‘magritrr’ pipe
operator, but not the dot-pipe operator from ‘wrapr’ (see section 6.5 on page
187).
ggplot(data = mtcars,
mapping = aes(x = disp, y = mpg)) +
geom_point(size = 4) +
geom_point(data = . %.>% subset(x = ., cyl == 4), color = "yellow",
size = 1.5)
7.4 Geometries
Different geometries support different aesthetics. While geom_point() supports
shape, and geom_line() supports linetype, both support x, y, color and size. In
Geometries 217
this section we will describe the different geometries available in package ‘ggplot2’
and some examples from packages that extend ‘ggplot2’. The graphic output from
most code examples will not be shown, with the expectation that readers will run
them to see the plots.
Mainly for historical reasons, geometries accept a statistic as an argument, in the
same way as statistics accept a geometry as an argument. In this section we will only
describe geometries which have as a default statistic stat_identity which passes
values directly as mapped. The geometries that have other statistics as default are
described in section 7.5.2 together with the corresponding statistics.
7.4.1 Point
As shown earlier in this chapter, geom_point(), can be used to add a layer with ob-
servations represented by “points” or symbols. Variable cyl describes the numbers
of cylinders in the engines of the cars. It is a numeric variable, and when mapped
to color, a continuous color scale is used to represent this variable.
The first examples build scatter plots, because numeric variables are mapped to
both x and y. Some scales, like those for color, exist in two “flavors,” one suitable
for numeric variables (continuous) and another for factors (discrete).
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = cyl)) +
geom_point()
35
30
cyl
8
25
7
mpg
20 6
5
15
4
10
100 200 300 400
disp
If we convert cyl into a factor, a discrete color scale is used instead of a con-
tinuous one.
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point()
If we convert cyl into an ordered factor, a different discrete color scale is used
by default.
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = ordered(cyl))) +
geom_point()
218 Grammar of graphics
The mapping between data values and aesthetic values is controlled by scales.
Different color scales, and even palettes within a given scale, provide different
mappings between data values and rendered colours.
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point() +
scale_color_brewer(type = "qual", palette = 2)
The data, aesthetics mappings, and geometries are the same as in earlier code;
to alter how the plot looks, we have changed only the scale and palette used for
the color aesthetic. Conceptually it is still exactly the same plot we created earlier,
except for the colours used. This is a very important point to understand, because
it allows us to separate two different concerns: the semantic structure and the
graphic design.
U Try the different palettes available through the brewer scale. You can play
directly with the palettes using function brewer_pal() from package ‘scales’
together with show_col()).
show_col(brewer_pal()(3))
Once you have found a suitable palette for these data, redo the plot above
with the chosen palette.
When not relying on colors, the most common way of distinguishing groups
of observations in scatter plots is to use the shape of the points as an aesthetic.
We need to change a single “word” in the code statement to achieve this different
mapping.
35
4
4
30 4 4
4
4
25 4
mpg
4 4
4 6 6
20 6 6 8
6 8
6 8
8 8
15 8 88 8 8
8
8
10 8 8
= One variable in the data can be mapped to more than one aesthetic, allow-
ing redundant aesthetics. This may seem wasteful, but it is extremely useful
as it allows one to produce figures that, even when produced in color, can still
be read if reproduced as black-and-white images.
Dot plots are similar to scatter plots but a factor is mapped to either the x or
y aesthetic. Dot plots are prone to have overlapping observations, and one way
of making these points visible is to make them partly transparent by setting a
constant value smaller than one for the alpha aesthetic.
35
30
25
mpg
20
15
10
4 6 8
factor(cyl)
220 Grammar of graphics
35 wt
2
30 3
4
25 5
mpg
20
factor(cyl)
15 4
6
10
8
100 200 300 400
disp
geom_point()
Make the plot, look at it carefully. Check the numerical values of some of
the weights, and assess if your perception of the plot matches the numbers
behind it.
35 wt
2
30 3
4
25 5
mpg
20
factor(cyl)
15 4
6
10
8
100 200 300 400
disp
U Play with the code in the chunk above. Remove or change each of the map-
pings and the scale, display the new plot, and compare it to the one above.
Continue playing with the code until you are sure you understand what graph-
ical element in the plot is added or modified by each individual argument or
“word” in the code statement.
7.4.2 Rug
Rarely, rug plots are used by themselves. Instead they are usually an addition to
scatter plots. An example of the use of geom_rug() follows. They make it easier to
see the distribution of observations along the 𝑥- and 𝑦-axes.
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl))) +
geom_point() +
geom_rug()
222 Grammar of graphics
35
30
factor(cyl)
25
mpg
4
20 6
8
15
10
100 200 300 400
disp
Rug plots are most useful when the local density of observations is not too
high, otherwise rugs become too cluttered and the “rug threads” may overlap.
When overlap is moderate, making the segments semitransparent by setting
the alpha aesthetic to a constant value smaller than one, can make the varia-
tion in density easier to appreciate. When the number of observations is large,
marginal density plots should be preferred.
ggplot(data = Orange,
aes(x = age, y = circumference, linetype = Tree)) +
geom_line()
200
Tree
circumference
150 3
1
5
100
2
4
50
angles can be controlled through additional aesthetics. These two geometries sup-
port arrow heads at their ends. Other geometries useful for drawing lines or seg-
ments are geom_path(), which is similar to geom_line(), but instead of joining ob-
servations according to the values mapped to x, it joins them according to their
row-order in data, and geom_spoke(), which is similar to geom_segment() but using
a polar parametrization, based on x, y for origin, and angle and radius for the
segment. Finally, geom_step() plots only vertical and horizontal lines to join the
observations, creating a stepped line.
ggplot(data = Orange,
aes(x = age, y = circumference, linetype = Tree)) +
geom_step()
200
Tree
circumference
150 3
1
5
100
2
4
50
U Using the following toy data, make three plots using geom_line(),
geom_path(), and geom_step to add a layer.
ggplot(data = Orange,
aes(x = age, y = circumference, fill = Tree)) +
geom_area(position = "stack")
224 Grammar of graphics
750
Tree
circumference
3
500 1
5
2
250
4
0
400 800 1200 1600
age
Finally, three geometries for drawing lines across the whole plotting area:
geom_hline, geom_vline and geom_abline. The first two draw horizontal and ver-
tical lines, respectively, while the third one draws straight lines according to the
aesthetics slope and intercept determining the position. The lines drawn with
these three geoms extend to the edge of the plotting area.
geom_hline and geom_vline require a single aesthetic, yintercept and
xintercept, respectively. Different from other geoms, the data for these aesthetics
can also be passed as constant numeric vectors. The reason for this is that these ge-
oms are most frequently used to annotate plots rather than plotting observations.
Let’s assume that we want to highlight an event at the age of 1000 days.
ggplot(data = Orange,
aes(x = age, y = circumference, fill = Tree)) +
geom_area(position = "stack") +
geom_vline(xintercept = 1000, color = "gray75") +
geom_vline(xintercept = 1000, linetype = "dotted")
750
Tree
circumference
3
500 1
5
2
250
4
0
400 800 1200 1600
age
U Change the order of the three layers in the example above. How did the
figure change? What order is best? Would the same order be the best for a
scatter plot? And would it be necessary to add two geom_vline() layers?
Geometries 225
7.4.4 Column
The geometry geom_col() can be used to create column plots where each bar rep-
resents an observation or case in the data.
R users not familiar yet with ‘ggplot2’ are frequently surprised by the
default behavior of geom_bar() as it uses stat_count() to produce a histogram,
rather than plotting values as is (see section 7.5.4 on page 245). geom_col() is
identical to geom_bar() but with "identity" as the default statistic.
We create artificial data that we will reuse in multiple variations of the next
figure.
set.seed(654321)
my.col.data <- data.frame(treatment = factor(rep(c("A", "B", "C"), 2)),
group = factor(rep(c("male", "female"), c(3, 3))),
measurement = rnorm(6) + c(5.5, 5, 7))
First we plot data for females only, using defaults for all aesthetics except 𝑥
and 𝑦 which we explicitly map to variables.
7.5
measurement
5.0
2.5
0.0
A B C
treatment
15
measurement
10 group
female
male
5
0
A B C
treatment
We next use a formal style, and in addition, put the bars side by side by
setting position = "dodge" to override the default position = "stack". Setting
color = NA removes the lines bordering the bars.
7.5
measurement
group
5.0
female
male
2.5
0.0
A B C
treatment
U Change the argument to position, or let the default be active, until you
understand its effect on the figure. What is the difference between positions
"identity", "dodge" and "stack"?
7.4.5 Tiles
We can draw square or rectangular tiles with geom_tile() producing tile plots or
simple heat maps.
We here generate 100 random draws from the 𝐹 distribution with degrees of
freedom 𝜈1 = 5, 𝜈2 = 20.
Geometries 227
set.seed(1234)
randomf.df <- data.frame(F.value = rf(100, df1 = 5, df2 = 20),
x = rep(letters[1:10], 10),
y = LETTERS[rep(1:10, rep(10, 10))])
geom_tile() requires aesthetics 𝑥 and 𝑦, with no defaults, and width and height
with defaults that make all tiles of equal size filling the plotting area.
ggplot(randomf.df, aes(x, y, fill = F.value)) +
geom_tile()
J
I
H F.value
G 4
F 3
y
E
2
D
C 1
B
A
a b c d e f g h i j
x
We can set color = "gray75" and size = 1 to make the tile borders more visible
as in the example below, or use a contrasting color, to better delineate the borders
of the tiles. What to use will depend on whether the individual tiles add meaningful
information. In cases like when rows of tiles correspond to individual genes and
columns to discrete treatments, the use of contrasting tile borders is preferable. In
contrast, in the case when the tiles are an approximation to a continuous surface
such as measurements on a regular spatial grid, it is best to suppress the tile
borders.
ggplot(randomf.df, aes(x, y, fill = F.value)) +
geom_tile(color = "gray75", size = 1.33)
J
I
H F.value
G 4
F 3
y
E
2
D
C 1
B
A
a b c d e f g h i j
x
U Play with the arguments passed to parameters color and size in the ex-
ample above, considering what features of the data are most clearly perceived
in each of the plots you create.
228 Grammar of graphics
Any continuous fill scale can be used to control the appearance. Here we show
a tile plot using a gray gradient, with missing values in red.
36.5°N AREA
36°N
35.5°N 0.20
35°N 0.15
34.5°N 0.10
34°N 0.05
84°W 82°W 80°W 78°W 76°W
7.4.7 Text
We can use geom_text() or geom_label() to add text labels to observations. For
geom_text() and geom_label(), the aesthetic label provides the text to be plotted
and the usual aesthetics x and y, the location of the labels. As one would expect,
the color and size aesthetics can also be used for the text.
geom_point() +
35 wt
4
4 2
30 4 4
3
4 4
4
25 4 5
mpg
4 4
4 6 6
20 6 6 8
8
6 6
8
8
factor(cyl)
8
15 8 88 8 8
8 4
8
6
10 88
8
100 200 300 400
disp
In addition, angle and vjust and hjust can be used to rotate the text and adjust
its position. The default value of 0.5 for both hjust and vjust sets the center of the
text at the supplied x and y coordinates. “Vertical” and “horizontal” for justifica-
tion refer to the text, not the plot. This is important when angle is different from
zero. Values larger than 0.5 shift the label left or down, and values smaller than
0.5, right or up with respect to its x and y coordinates. A value of 1 or 0 sets the text
so that its edge is at the supplied coordinate. Values outside the range 0 … 1 shift
the text even farther away, however, still using units based on the length or height
of the text label. Recent versions of ‘ggplot2’ make possible justification using
character constants for alignment: "left", "middle", "right", "bottom", "center"
and "top", and two special alignments, "inward" and "outward", that automatically
vary based on the position in the plotting area.
In the case of geom_label() the text is enclosed in a box, which obeys the fill
aesthetic and takes additional parameters (described starting at page 231) allowing
control of the shape and size of the box. However, geom_label() does not support
rotation with the angle aesthetic.
You should be aware that R and ‘ggplot2’ support the use of UNICODE,
such as UTF8 character encodings in strings. If your editor or IDE supports
their use, then you can type Greek letters and simple maths symbols directly,
and they may show correctly in labels if a suitable font is loaded and an ex-
tended encoding like UTF8 is in use by the operating system. Even if UTF8 is
in use, text is not fully portable unless the same font is available, as even if
the character positions are standardized for many languages, most UNICODE
fonts support at most a small number of languages. In principle one can use
this mechanism to have labels both using other alphabets and languages like
Chinese with their numerous symbols mixed in the same figure. Furthermore,
the support for fonts and consequently character sets in R is output-device de-
pendent. The font encoding used by R by default depends on the default locale
settings of the operating system, which can also lead to garbage printed to the
console or wrong characters being plotted running the same code on a differ-
ent computer from the one where a script was created. Not all is lost, though, as
230 Grammar of graphics
R can be coerced to use system fonts and Google fonts with functions provided
by packages ‘showtext’ and ‘extrafont’. Encoding-related problems, especially
in MS-Windows, are common.
my.data <-
data.frame(x = 1:5,
y = rep(2, 5),
In the next example we select a different font family, using the same characters
in the Roman alphabet. The names "sans" (the default), "serif" and "mono" are
recognized by all graphics devices on all operating systems. Additional fonts are
available for specific graphic devices, such as the 35 “PDF” fonts by the pdf()
device. In this case, their names can be queried with names(pdfFonts()).
U In the examples above the character strings were all of the same length,
containing a single character. Redo the plots above with longer character
strings of various lengths mapped to the label aesthetic. Do also play with
justification of these labels.
my.data <-
data.frame(x = 1:5, y = rep(2, 5), label = paste("alpha[", 1:5, "]", sep = ""))
my.data$label
## [1] "alpha[1]" "alpha[2]" "alpha[3]" "alpha[4]" "alpha[5]"
Geometries 231
Text and labels do not automatically expand the plotting area past their anchor-
ing coordinates. In the example above, we need to use expand_limits() to ensure
that the text is not clipped at the edge of the plotting area.
2.025
2.000 a1 a2 a3 a4 a5
y
1.975
1.950
1 2 3 4 5
x
In the example above, we mapped to label the text to be parsed. It is also pos-
sible, and usually preferable, to build suitable labels on the fly within aes() when
setting the mapping for label. Here we use geom_text() with strings to be parsed
into expressions created on the fly within the call to aes(). The same approach can
be used for regular character strings not requiring parsing.
my.data <-
data.frame(x = 1:5, y = rep(2, 5),
label = c("one", "two", "three", "four", "five"))
2.025
1.950
1 2 3 4 5
x
U Play with the arguments to the different parameters and with the aesthet-
ics to get an idea of what can be done with them. For example, use thicker
border lines and increase the padding so that a visually well-balanced margin
is retained. You may also try mapping the fill and color aesthetics to factors
in the data.
ggplot(data = mtcars,
aes(x = disp, y = mpg, color = factor(cyl), size = wt, label = cyl)) +
scale_size() +
geom_point(alpha = 1/3) +
geom_text_repel(color = "black", size = 3,
min.segment.length = 0.2, point.padding = 0.1)
35 wt
4
4 2
30 4 4 3
4 4 4
25 4
5
4
mpg
4 6
4 6
20 4 6 6 8
6
6
6 8
8
8
8 factor(cyl)
8 8
15 8 8
8 4
8
8
8 6
10 8
8
100 200 300 400
disp
Geometries 233
The plotting of tables by mapping a list of data frames to the label aesthetic is
done with geom_table. Positioning, justification, and angle work as for geom_text
and are applied to the whole table. Only tibble objects (see documentation of
package ‘tibble’) can contain, as variables, lists of data frames, so this geometry
requires the use of tibble objects to store the data. The table(s) are created as
’grid’ grob objects, collected in a tree and added to the ggplot object as a new
layer.
We first generate a tibble containing summaries from the data, formatted as
character strings, wrap this tibble in a list, and store this list as a column in an-
other tibble. To accomplish this, we use functions from the ‘tidyverse’ described
in chapter 6.
mtcars %.>%
group_by(., cyl) %.>%
summarize(.,
"mean wt" = format(mean(wt), digits = 2),
"mean disp" = format(mean(disp), digits = 0),
"mean mpg" = format(mean(mpg), digits = 0)) -> my.table
table.tb <- tibble(x = 500, y = 35, table.inset = list(my.table))
mpg
20
factor(cyl)
15 4
6
10
8
100 200 300 400 500
disp
The color and size aesthetics control the text in the table(s) as a whole. It is
also possible to rotate the table(s) using angle. As with text labels, justification is
interpreted in relation to table-text orientation. We set the y = 0 in data.tb and
then use vjust = 1 to position the top of the table at this coordinate value.
Parsed text, using R’s plotmath syntax is supported in the table, with fallback to
plain text in case of parsing errors, on a cell-by-cell basis. We end this section with
a simple example, which even if not very useful, demonstrates that geom_table()
behaves like a “normal” ggplot geometry and that a table can be the only layer in a
ggplot if desired. The addition of multiple tables with a single call to geom_table()
by passing a tibble with multiple rows as an argument for data is also possible.
for zooming-in on parts of a main plot where observations are crowded and for
displaying summaries based on the observations shown in the main plot. The inset
plots are nested in viewports which control the dimensions of the inset plot, and
aesthetics vp.height and vp.width control their sizes—with defaults of 1/3 of the
height and width of the plotting area of the main plot. Themes can be applied
separately to the main and inset plots.
In the first example of inset plots, we include one of the summaries shown
above as an inset table. We first create a tibble containing the plot to be inset.
mtcars %.>%
group_by(., cyl) %.>%
summarize(., mean.mpg = mean(mpg)) %.>%
ggplot(data = .,
aes(factor(cyl), mean.mpg, fill = factor(cyl))) +
scale_fill_discrete(guide = FALSE) +
scale_y_continuous(name = NULL) +
geom_col() +
theme_bw(8) -> my.plot
plot.tb <- tibble(x = 500, y = 35, plot.inset = list(my.plot))
geom_point() +
geom_plot(data = plot.tb,
vp.width = 1/2,
35
20
30 10
0
4 6 8 factor(cyl)
factor(cyl)
25
mpg
4
6
20
8
15
10
100 200 300 400 500
disp
In the second example we add the zoomed version of the same plot as an inset.
1) Manually set limits to the coordinates to zoom into a region of the main plot,
2) set the theme of the inset, 3) remove axis labels as they are the same as in
the main plot, 4) and 5) highlight the zoomed-in region in the main plot. This
fairly complex example shows how a new extension to ‘ggplot2’ can integrate well
into the grammar of graphics paradigm. In this example, to show an alternative
approach, instead of collecting all the data into a data frame, we map constant
values directly to the various aesthetics within annotate() (see section 7.8 on page
269).
linetype = "dotted")
35
19
18
30 17
16
15
factor(cyl)
25
mpg
14
280 300 320
4
20 6
8
15
10
100 200 300 400
disp
file1.name <-
system.file("extdata", "Isoquercitin.png", package = "ggpmisc", mustWork = TRUE)
Isoquercitin <- magick::image_read(file1.name)
file2.name <-
system.file("extdata", "Robinin.png", package = "ggpmisc", mustWork = TRUE)
Robinin <- magick::image_read(file2.name)
grob.tb <- tibble(x = c(0, 100), y = c(10, 20), height = 1/3, width = c(1/2),
grobs = list(grid::rasterGrob(image = Isoquercitin),
grid::rasterGrob(image = Robinin)))
ggplot() +
geom_grob(data = grob.tb,
aes(x = x, y = y, label = grobs, vp.height = height, vp.width = width),
hjust = "inward", vjust = "inward")
Geometries 237
20.0
17.5
15.0
y
12.5
10.0
0 25 50 75 100
x
Grid graphics provide the low-level functions that both ‘ggplot2’ and
‘lattice’ use under the hood. Grid supports different types of units for express-
ing the coordinates of positions within the plotting area. All examples outside
this text box use "native" data coordinates, however, coordinates can be also
given in physical units like "mm". More useful when working with scalable plots
is to use ”npc” normalized parent coordinates, which are expressed as num-
bers in the range 0 to 1, relative to the dimensions of the sides of the current
viewport, with origin at the lower left corner.
Package ‘ggplot2’ interprets 𝑥 and 𝑦 coordinates in "native" data coor-
dinates, and trickery seems to be needed to get around this limitation. A
rather general solution is provided by package ‘ggpmisc’ through aesthet-
ics npcx and npcy and geometries that support them. At the time of writing,
geom_text_npc(), geom_label_npc(), geom_table_npc(), geom_plot_npc() and
geom_grob_npc(). These geometries are useful for annotating plots and adding
insets at positions relative to the plotting area that remain always consistent
across different plots, or across panels when using facets with free axis lim-
its. Being geometries they provide freedom in the elements added to different
panels and their positions.
35
a label
30
factor(cyl)
25
4
mpg
20 6
8
15
10
100 200 300 400
disp
238 Grammar of graphics
7.5 Statistics
Before learning about ‘ggplot2’ statistics, it is important to have clear how the map-
ping of factors to aesthetics works. When a factor, for example, is mapped to color,
it creates a new grouping, with the observations matching a given level of the fac-
tor, corresponding to a group. Most statistics operate on the data for each of these
groups separately, returning a summary for each group, for example, the mean of
the observations in a group.
7.5.1 Functions
In addition to plotting data from a data frame with variables to map to 𝑥 and 𝑦 aes-
thetics, it is possible to have only a variable mapped to 𝑥 and use stat_function()
to compute the values to be mapped to 𝑦 using an R function. This avoids the need
to generate data beforehand as even the number of data points to be generated can
be set in geom_function(). Any R function, user defined or not, can be used as long
as it is vectorized, with the length of the returned vector equal to the length of
the vector passed as first argument to it. The variable mapped to x determines the
range, and the argument to parameter n of geom_function() the length of the gen-
erated vector that is passed as first argument to fun when it is called to generate
the values to napped to y. These are the 𝑥 and 𝑦 values passed to the geometry.
We start with the Normal distribution function. We rely on the defaults n = 101
and geom = "path".
ggplot(data.frame(x = -3:3), aes(x = x)) +
stat_function(fun = dnorm)
0.4
0.3
0.2
y
0.1
0.0
-2 0 2
x
Using a list we can even pass by name additional arguments to use when the
function is called.
ggplot(data.frame(x = -3:3), aes(x = x)) +
stat_function(fun = dnorm, args = list(mean = 1, sd = .5))
U Edit the code above so as to plot in the same figure three curves, either for
three different values for mean or for three different values for sd.
Statistics 239
7.5.2 Summaries
The summaries discussed in this section can be superimposed on raw data plots,
or plotted on their own. Beware, that if scale limits are manually set, the summaries
will be calculated from the subset of observations within these limits. Scale limits
can be altered when explicitly defining a scale or by means of functions xlim() and
ylim(). See section 7.9 on page 272 for an explanation of how coordinate limits
can be used to zoom into a plot without excluding of 𝑥 and 𝑦 values from the
data.
It is possible to summarize data on the fly when plotting. We describe in the
same section the calculation of measures of central tendency and of variation, as
stat_summary() allows them to be calculated simultaneously and added together
with a single layer.
For use in the examples, we generate some normally distributed artificial data.
fake.data <- data.frame(
y = c(rnorm(10, mean = 2, sd = 0.5),
rnorm(10, mean = 4, sd = 0.7)),
group = factor(c(rep("A", 10), rep("B", 10)))
)
We will reuse a “base” scatter plot in a series of examples, so that the differ-
ences are easier to appreciate. We first add just the mean. In this case, we need
to pass as an argument to stat_summary(), the geom to use, as the default one,
geom_pointrange(), expects data for plotting error bars in addition to the mean.
This example uses a hyphen character as the constant value of shape (see the exam-
ple for geom_point() on page 219 on the use of digits as shape). Instead of pass-
ing "mean" as an argument to parameter fun (earlier called fun.y), we can pass,
if desired, other summary functions like "median". In the case of these functions
that return a single computed value, we pass them, or character strings with their
names, as an argument to parameter fun.
ggplot(data = fake.data, aes(y = y, x = group)) +
geom_point(shape = 21) +
stat_summary(fun = "mean", geom = "point",
color = "red", shape = "-", size = 10)
240 Grammar of graphics
4
-
3
y
2 -
1
A B
group
To pass as an argument a function that returns a central value like the mean
plus confidence or other limits, we use parameter fun.data instead of fun. In the
next example we add means and confidence intervals for 𝑝 = 0.95 (the default)
assuming normality.
We can override the default of 𝑝 = 0.95 for confidence intervals by setting, for
example, conf.int = 0.90 in the list of arguments passed to the function. The in-
tervals can also be computed without assuming normality, using the empirical dis-
tribution estimated from the data by bootstrap. To achieve this we pass to fun.data
the argument "mean_cl_boot" instead of "mean_cl_normal".
stat_summary(fun.data = "mean_cl_boot",
stat_summary(fun.data = "mean_se",
geom = "errorbar",
20
hwy
10
0
2seater compact midsize minivan pickup subcompact suv
class
We can easily add error bars to the column plot. We use size to make
the lines of the error bars thicker. The default geometry in stat_summary() is
geom_pointrange(), so we can pass "linerange" as an argument for geom to elimi-
nate the point.
The “reverse” syntax is also valid, as we can add the geometry to the plot
object and pass the statistics as an argument to it. In general in this book we
avoid this alternative syntax for the sake of consistency.
In most cases we will want to plot the observations as points together with the
smoother. We can plot the observation on top of the smoother, as done here, or
the smoother on top of the observations.
35
30
25
mpg
20
15
10
Instead of using the default spline, we can fit a different model. In this example
we use a linear model as smoother, fitted by lm().
These data are really grouped, so we map variable cyl to the color aesthetic.
Now we get three groups of points with different colours but also three separate
smooth lines.
35
30
factor(cyl)
25
4
mpg
20 6
8
15
10
100 200 300 400
disp
To obtain a single smoother for the three groups, we need to set the mapping
of the color aesthetic to a constant within stat_smooth. This local value overrides
the default color mapping set in ggplot() just for this plot layer. We use "black"
but this could be replaced by any other color definition known to R.
35
30
factor(cyl)
25
4
mpg
20 6
8
15
10
100 200 300 400
disp
se = FALSE)
In the second example we define the same model directly in the model formula,
and provide the starting values explicitly. The names used for the parameters to
be fitted can be chosen at will, within the restrictions of the R language, but of
course the names used in formula and start must match each other.
geom_point()
35
y = 20.1 - 28.4 x + 9.15 x 2
30
factor(cyl)
25
4
mpg
20 6
8
15
10
100 200 300 400
disp
This same package makes it possible to annotate plots with summary tables
from a model fit.
color = "black",
Estimate = "estimate",
Statistics 245
"s.e." = "std.error",
"italic(t)" = "statistic",
"italic(P)" = "p.value"),
label.y.npc = "top", label.x.npc = "right",
parse = TRUE) +
geom_point()
35
Parameter Estimate s.e. t P
30 (Intercept) 20.1 0.501 40.1 6.08e-27
20 6
8
15
10
100 200 300 400
disp
set.seed(12345)
my.data <-
data.frame(x = rnorm(200),
y = c(rnorm(100, -1, 1), rnorm(100, 1, 1)),
group = factor(rep(c("A", "B"), c(100, 100))) )
246 Grammar of graphics
ggplot(my.data, aes(x)) +
geom_histogram(bins = 15)
30
count
20
10
0
-2 -1 0 1 2
x
The computed values are contained in the data that the geometry “receives”
from the statistic. Many statistics compute additional values that are not mapped
by default. These can be mapped with aes() by enclosing them in a call to stat().
From the help page we can learn that in addition to counts in variable count, den-
sity is returned in variable density by this statistic. Consequently, we can create
a histogram with the counts per bin expressed as densities whose integral is one
(rather than their sum, as the width of the bins is in this case different from one),
as follows.
0.6
0.4
group
density
A
B
0.2
0.0
-2 0 2
y
If it were not for the easier to remember name of geom_histogram(), adding the
layers with stat_bin() or stat_count() would be preferable as it makes clear that
computations on the data are involved.
2 count
12
0 9
y
6
-2 3
-4
-2 -1 0 1 2 3
x
2
count
12
y
0
6
3
-2
-2 -1 0 1 2
x
0.5
0.4
group
density
0.3
A
0.2 B
0.1
0.0
-2 0 2
y
Examples of 2D density plots follow. In the first example we use two geome-
tries which were earlier described, geom_point() and geom_rug(), to plot the ob-
servations in the background. With stat_density_2d() we add a two-dimensional
density “map” represented using isolines. We map group to the color aesthetic.
geom_rug() +
stat_density_2d()
group
0 A
y
-2
-2 -1 0 1 2
x
geom_density_2d()
In the next example we plot the groups in separate panels, and use a geometry
supporting the fill aesthetic and we map to it the variable level, computed by
stat_density_2d()
A B
2 level
0.15
0
y
0.10
-2 0.05
-2 -1 0 1 2 -2 -1 0 1 2
x
y -2
A B
group
As with other statistics, their appearance obeys both the usual aesthetics
such as color, and parameters specific to this type of visual representation:
outlier.color, outlier.fill, outlier.shape, outlier.size, outlier.stroke and
outlier.alpha, which affect the outliers in a way similar to the equivalent
aethetics in geom_point(). The shape and width of the “box” can be adjusted with
notch, notchwidth and varwidth. Notches in a boxplot serve a similar role for com-
paring medians as confidence limits serve when comparing means.
*
2
*
0
y
**
-2 *
A B
group
group
0 A
y
-2
A B
group
As with other geometries, their appearance obeys both the usual aesthetics such
as color, and others specific to these types of visual representation.
Other types of displays related to violin plots are beeswarm plots and sina
plots, and can be produced with geometries defined in packages ‘ggbeeswarm’ and
‘ggforce’, respectively. A minimal example of a beeswarm plot is shown below. See
the documentation of the packages for details about the many options in their use.
0
y
-2
A B
group
252 Grammar of graphics
7.6 Facets
Facets are used in a special kind of plots containing multiple panels in which the
panels share some properties. These sets of coordinated panels are a useful tool
for visualizing complex data. These plots became popular through the trellis
graphs in S, and the ‘lattice’ package in R. The basic idea is to have rows and/or
columns of plots with common scales, all plots showing values for the same re-
sponse variable. This is useful when there are multiple classification factors in
a data set. Similar-looking plots, but with free scales or with the same scale but
a ‘floating’ intercept, are sometimes also useful. In ‘ggplot2’ there are two pos-
sible types of facets: facets organized in a grid, and facets along a single ‘axis’
of variation but, possibly, wrapped into two or more rows. These are produced
by adding facet_grid() or facet_wrap(), respectively. In the examples below we
use geom_point() but faceting can be used with ggplot objects containing diverse
kinds of layers, displaying either observations or summaries from data.
We start by creating and saving a single-panel plot that we will use through this
section to demonstrate how the same plot changes when we add facets.
35
30
25
mpg
20
15
10
2 3 4 5
wt
A grid of panels has two dimensions, rows and cols. These dimensions in the
grid of plot panels can be “mapped” to factors. Until recently a formula syntax was
the only available one. Although this notation has been retained, the preferred syn-
tax is currently to use the parameters rows and cols. We use cols in this example.
Note that we need to use vars() to enclose the names of the variables in the data.
The “headings” of the panels or strip labels are by default the levels of the factors.
p + facet_grid(cols = vars(cyl))
Facets 253
4 6 8
35
30
25
mpg 20
15
10
2 3 4 5 2 3 4 5 2 3 4 5
wt
In the “historical notation” the same plot would have been coded as follows.
p + facet_grid(. ~ cyl)
By default, all panels share the same scale limits and share the plotting space
evenly, but these defaults can be overridden.
Margins display an additional column or row of panels with the combined data.
4 6 8 (all)
35
30
25
mpg
20
15
10
2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5
wt
We can represent more than one variable per dimension of the grid of plot
panels. For this example, we also override the default labeller used for the panels
with one that includes the name of the variable in addition to factor levels in the
strip labels.
30
25
mpg
20
15
10
2 3 4 5 2 3 4 5 2 3 4 5 2 3 4 5
wt
More frequently we may need to include the levels of the factor used in the
faceting as part of the labels. Here we use as labeller, function label_bquote()
with a special syntax that allows us to use an expression where replacement
based on the facet (panel) data takes place. See section 7.12 for an example of
the use of bquote(), the R function on which label_bquote(), is built.
p +
facet_grid(cols = vars(cyl),
labeller = label_bquote(cols = .(cyl)~"cylinders"))
In the next example we create a plot with wrapped facets. In this case the num-
ber of levels is small, and no wrapping takes place by default. In cases when more
panels are present, wrapping into two or more continuation rows is the default.
Here, we force wrapping with nrow = 2. When using facet_wrap() there is only one
dimension, and the parameter is called facets, instead of rows or cols.
4 6
35
30
25
20
15
10
mpg
2 3 4 5
8
35
30
25
20
15
10
2 3 4 5
wt
The example below (plot not shown), is similar to the earlier one for facet_grid,
but faceting according to two factors with facet_wrap() along a single wrapped
row of panels.
7.7 Scales
In earlier sections of this chapter, examples have used the default scales or we
have set them with convenience functions. In the present section we describe in
more detail the use of scales. There are scales available for different aesthetics
(≈ attributes) of the plotted geometrical objects, such as position (x, y, z), size,
shape, linetype, color, fill, alpha or transparency, angle. Scales determine how
values in data are mapped to values of an aesthetics, and how these values are
labeled.
Depending on the characteristics of the data being mapped, scales can be
continuous or discrete, for numeric or factor variables in data, respectively. On
the other hand, some aesthetics, like size, can vary continuously but others like
linetype are inherently discrete. In addition to discrete scales for inherently dis-
crete aesthetics, discrete scales are available for those aesthetics that are inherently
continuous, like x, y, size, color, etc.
The scales used by default set the mapping automatically (e.g., which color value
corresponds to 𝑥 = 0 and which one to 𝑥 = 1). However, for each aesthetic such
as color, there are multiple scales to choose from when creating a plot, both con-
tinuous and discrete (e.g., 20 different color scales in ‘ggplot2’ 3.2.0).
256 Grammar of graphics
The most direct mapping to data is identity, which means that the data is
taken at its face value. In a color scale, say scale_color_identity(), the vari-
able in the data would be encoded with values such as "red", "blue"—i.e., valid
R colours. In a simple mapping using scale_color_discrete() levels of a factor,
such as "treatment" and "control" would be represented as distinct colours with
the correspondence of individual factor levels to individual colours selected au-
tomatically by default. In contrast with scale_color_manual() the user needs to
explicitly provide the mapping between factor levels and colours by passing argu-
ments to the scale functions’ parameters breaks and values.
A continuous data variable needs to be mapped to an aesthetic through a contin-
uous scale such as scale_color_continuous() or one its various variants. Values in
a numeric variable will be mapped into a continuous range of colours, determined
either automatically through a palette or manually by giving the colours at the
extremes, and optionally at multiple intermediate values, within the range of vari-
ation of the mapped variable (e.g., scale settings so that the color varies gradually
between "red" and "gray50"). Handling of missing values is such that mapping a
value in a variable to an NA value for an aesthetic such as color makes the mapped
values invisible. The reverse, mapping NA values in the data to a specific value of
an aesthetic is also possible (e.g., displaying NA values in the mapped variable in
red, while other values are mapped to shades of blue).
200 Tree
150 3
1
100
5
50 2
4
0
400 800 1200 1600
Time (d)
Convenience functions xlab() and ylab() can be used to set the axis labels to
match those in the previous chunk.
xlab("Time (d)") +
ylab("Circumference (mm)") +
Convenience function labs() is useful when we use default scales for all the
aesthetics in a plot but want to manually set axis labels and/or key titles—i.e., the
name of these scales. labs() accepts arguments for these names using, as parameter
names, the names of the aesthetics. It also allows us to set title, subtitle, caption
and tag, of which the first two can also be set with ggtitle().
ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
expand_limits(y = 0) +
labs(title = "Growth of orange trees",
subtitle = "Starting from 1968-12-31",
caption = "see Draper, N. R. and Smith, H. (1998)",
tag = "A",
x = "Time (d)",
y = "Circumference (mm)",
color = "Tree\nnumber")
A
Growth of orange trees
Starting from 1968-12-31
Circumference (mm)
Tree
200 number
150 3
100 1
50 5
2
0
400 800 1200 1600 4
Time (d)
see Draper, N. R. and Smith, H. (1998)
258 Grammar of graphics
fake2.data <-
data.frame(y = c(rnorm(20, mean = 20, sd = 5),
rnorm(20, mean = 40, sd = 10)),
group = factor(c(rep("A", 20), rep("B", 20))),
z = rnorm(40, mean = 12, sd = 6))
7.7.2.1 Limits
Limits are relevant to all kinds of scales. Limits are set through parameter limits
of the different scale functions. They can also be set with convenience functions
xlim() and ylim() in the case of the x and y aesthetics, and more generally with
function lims() which like labs(), takes arguments named according to the name
of the aesthetics. The limits argument of scales accepts vectors, factors or a func-
tion computing them from data. In contrast, the convenience functions do not
accept functions as their arguments.
In the next example we set “hard” limits, which will exclude some observations
from the plot and from any computation of summaries or fitting of smoothers.
Scales 259
More exactly, the off-limits observations are converted to NA values before they are
passed as data to geometries.
ggplot(fake2.data, aes(z, y)) + geom_point() +
scale_y_continuous(limits = c(0, 100))
To set only one limit leaving the other free, we can use NA as a boundary.
scale_y_continuous(limits = c(50, NA))
Convenience functions ylim() and xlim() can be used to set the limits to the
default 𝑥 and 𝑦 scales in use. We here use ylim(), but xlim() is identical except
for the scale it affects.
ylim(50, NA)
60
40
y
20
0
0 5 10 15 20
z
The expand parameter of the scales plays a different role than expand_limits().
It controls how much larger the “visual” plotting area is compared to the limits of
the actual plotting area. In other words, it adds a “margin” or padding to the plot-
ting area outside the limits set either dynamically or manually. Very rarely plots
are drawn so that observations are plotted on top of the axes, avoiding this is a key
role of expand. Rug plots and marginal annotations will also require the plotting
area to be expanded. In ‘ggplot2’ the default is to always apply some expansion.
We here set the upper limit of the plotting area to be expanded by adding
padding to the top and remove the default padding from the bottom of the plotting
area.
260 Grammar of graphics
ggplot(fake2.data,
aes(fill = group, color = group, x = y)) +
stat_density(alpha = 0.3) +
scale_y_continuous(expand = expand_scale(add = c(0, 0.02)))
Here we instead use a multiplier to a similar effect as above; we add 10% com-
pared to the range of the limits.
In the case of scales, we cannot reverse their direction through the setting of
limits. We need instead to use a transformation as described in section 7.7.2.3 on
page 261. But, inconsistently, xlim() and ylim() do implicitly allow this transfor-
mation through the numeric values passed as limits.
U Test what the result is when the first limit is larger than the second one.
Is it the same as when setting these same values as limits with ylim()?
We can set tick labels manually, in parallel to the setting of breaks by passing as
arguments two vectors of equal length. In the next example we use an expression
to obtain a Greek letter.
60
40
y
10p
20
0 5 10 15 20
z
Package ‘scales’ provides several functions for the automatic generation of la-
bels. For example, to display tick labels as percentages for data available as decimal
fractions, we can use function scales::percent().
100%
80%
y/max(y)
60%
40%
20%
0 5 10 15 20
z
For currency, we can use scales::dollar(), to include commas separating thou-
sands, millions, so on, we can use scales::comma(), and for numbers formatted
using exponents of 10—useful for logarithmic-transformed scales—we can use
scales::scientific_format(). It is also possible to use user-defined functions
both for breaks and labels.
60
50
40
y
30
20
20 15 10 5 0
z
Axis tick-labels display the original values before applying the transformation.
The "breaks" need to be given in the original scale as well. We use scale_y_log10()
to apply a log10 transformation to the 𝑦 values.
scale_y_log10(breaks=c(10,20,50,100))
scale_y_continuous(trans = "reciprocal")
Natural logarithms are important in growth analysis as the slope against time
gives the relative growth rate. We show this with the Orange data set.
ggplot(data = Orange,
aes(x = age, y = circumference, color = Tree)) +
geom_line() +
geom_point() +
scale_y_continuous(trans = "log", breaks = c(20, 50, 100, 200))
Scales 263
wt
2 3 4 5
35
30
25
mpg
20
15
10
35
0.03
30
25
mpg
1/y
20 0.05
15 0.07
0.09
10
2 3 4 5
wt
It is also possible to use different breaks and labels than for the main axes,
and to provide a different name to be used as a secondary axis label.
Warnings are issued in the next two chunks as we are using scale limits
to subset a part of the observations present in data.
ggplot(data = weather_wk_25_2019.tb,
aes(with_tz(time, tzone = "EET"), air_temp_C)) +
geom_line() +
scale_x_datetime(name = NULL,
breaks = ymd_hm("2019-06-11 12:00", tz = "EET") + days(0:1),
limits = ymd_hm("2019-06-11 00:00", tz = "EET") + days(c(0, 2))) +
scale_y_continuous(name = "Air temperature (C)") +
expand_limits(y = 0)
20
10
0
2019-06-11 12:00:00 2019-06-12 12:00:00
By default the tick labels produced and their formatting are automatically se-
lected based on the extent of the time data. For example, if we have all data col-
lected within a single day, then the tick labels will show hours and minutes. If we
plot data for several years, the labels will show the date portion of the time instant.
The default is frequently good enough, but it is possible, as for numbers, to use
different formatter functions to generate the tick labels.
ggplot(data = weather_wk_25_2019.tb,
aes(with_tz(time, tzone = "EET"), air_temp_C)) +
geom_line() +
scale_x_datetime(name = NULL,
date_breaks = "1 hour",
limits = ymd_hm("2019-06-16 00:00", tz = "EET") + hours(c(6, 18)),
date_labels = "%H:%M") +
scale_y_continuous(name = "Air temperature (C)") +
expand_limits(y = 0)
10
0
06:00 07:00 08:00 09:00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00
20
hwy
10
0
COMPACT SUBCOMPACT MIDSIZE
class
If, as in the previous example, only the case of character strings needs to be
changed, passing function toupper() or tolower() allows a more general and less
error-prone approach. In fact any function, user defined or not, which converts the
values of limits into the desired values can be passed as an argument to labels.
Alternatively, we can change the order of the columns in the plot by reordering
the levels of factor mpg$class. This approach makes sense if the ordering needs
to be done programmatically based on values in data. See section 2.12 on page 56
for details. The example below shows how to reorder the columns, corresponding
to the levels of class based on the mean() of hwy.
7.7.5 Size
For the size aesthetic, several scales are available, both discrete and con-
tinuous. They do not differ much from those already described above. Ge-
ometries geom_point(), geom_line(), geom_hline(), geom_vline(), geom_text(),
geom_label() obey size as expected. In the case of geom_bar(), geom_col(),
geom_area() and all other geometric elements bordered by lines, size is obeyed
by these border lines. In fact, other aesthetics natural for lines such as linetype
also apply to these borders.
When using size scales, breaks and labels affect the key or guide. In scales that
produce a key passing guide = FALSE removes the key corresponding to the scale.
separate but equivalent sets of scales available for these two aesthetics. We will
describe in more detail the color aesthetic and give only some examples for fill.
We will, however, start by reviewing how colors are defined and used in R.
length(colors())
## [1] 657
col2rgb("purple")
## [,1]
## red 160
## green 32
## blue 240
col2rgb("#FF0000")
## [,1]
## red 255
## green 0
## blue 0
With function rgb() we can define new colors. Enter help(rgb) for more details.
rgb(1, 1, 0)
## [1] "#FFFF00"
As described above, colors can be defined in the RGB color space, however,
other color models such as HSV (hue, saturation, value) can be also used to define
colours.
hsv(c(0,0.25,0.5,0.75,1), 0.5, 0.5)
## [1] "#804040" "#608040" "#408080" "#604080" "#804040"
Probably a more useful flavor of HSV colors for use in scales are those returned
by function hcl() for hue, chroma and luminance. While the “value” and “satura-
tion” in HSV are based on physical values, the “chroma” and “luminance” values in
HCL are based on human visual perception. Colours with equal luminance will be
seen as equally bright by an “average” human. In a scale based on different hues
but equal chroma and luminance values, as used by package ‘ggplot2’, all colours
are perceived as equally bright. The hues need to be expressed as angles in degrees,
with values between zero and 360.
hcl(c(0,0.25,0.5,0.75,1) * 360)
## [1] "#FFC5D0" "#D4D8A7" "#99E2D8" "#D5D0FC" "#FFC5D0"
0.050
0.025
0.000
y
-0.025
-0.050
2.5 5.0 7.5 10.0
x
U How does the plot look, if the identity scale is deleted from the example
above? Edit and re-run the example code.
While using the identity scale, how would you need to change the code ex-
ample above, to produce a plot with green and purple points?
While layers added to a plot directly using geometries and statistics re-
spect faceting, annotation layers added with annotate() are replicated un-
changed in every panel of a faceted plot. The reason is that annotation layers
accept aesthetics only as constant values which are the same for every panel
as no grouping is possible without a mapping to data.
60
40
y
20
0 origin
0 5 10 15 20
z
U Play with the values of the arguments to annotate() to vary the position,
size, color, font family, font face, rotation angle and justification of the anno-
tation.
It is relatively common to use inset tables, plots, bitmaps or vector plots as an-
notations. With annotation_custom(), grobs (‘grid’ graphical object) can be added
to a ggplot. To add another or the same plot as an inset, we first need to convert
it into a grob. In the case of a ggplot we use ggplotGrob(). In this example the
inset is a zoomed-in window into the main plot. In addition to the grob, we need
to provide the coordinates expressed in “natural” data units of the main plot for
the location of the grob.
Adding annotations 271
60
40
35
50 30
y
25
40 20
y
5 6 7 8 9 10
z
30
20
0 10 20 30 40
z
This approach has the limitation that if used together with faceting, the inset
will be the same for each plot panel. See section 7.4.8 on page 233 for geometries
that can be used to add insets.
In the next example, in addition to adding expressions as annotations, we
also pass expressions as tick labels through the scale. Do notice that we use
recycling for setting the breaks, as c(0, 0.5, 1, 1.5, 2) * pi is equivalent to
c(0, 0.5 * pi, pi, 1.5 * pi, 2 * pi. Annotations are plotted at their own posi-
tion, unrelated to any observation in the data, but using the same coordinates and
units as for plotting the data.
1.0
0.5
+
sin(x)
0.0
-0.5
-1.0
-
0 0.5 p p 1.5 p 2p
x
U Modify the plot above to show the cosine instead of the sine function, re-
placing sin with cos. This is easy, but the catch is that you will need to relocate
the annotations.
N
300
200
Frequency
100
0 W E
Wind direction
N
0.006
0.004
0.002
Density
0.000 W E
Wind direction
As the final wind-rose plot example, we do 2D density plot with facets added
274 Grammar of graphics
with facet_wrap() to have separate panels for AM and PM. This plot uses fill to
describe the density of observations for different combinations wind directions
and speeds, the radius (𝑦 aesthetic) to represent wind speeds and the angle (𝑥
aesthetic) to represent wind direction.
AM PM
N N
Wind speed (m/s)
3 level
2 0.0100
1 0.0075
W E W E
0.0050
0.0025
S S
Wind direction
Pie charts are more difficult to read than bar charts because our brain
is better at comparing lengths than angles. If used, pie charts should only be
used to show composition, or fractional components that add up to a total. In
this case, used only if the number of “pie slices” is small (rule of thumb: seven
at most), however in general, they are best avoided.
As we use geom_bar() which defaults to use stat_count. We use the brewer scale
for nice colors.
0
Vehicle class
200 2seater
compact
50
midsize
minivan
pickup
150 subcompact
100 suv
count
U Edit the code for the pie chart above to obtain a bar chart. Which one of
the two plots is easier to read?
7.10 Themes
In ‘ggplot2’, themes are the equivalent of style sheets. They determine how the
different elements of a plot are rendered when displayed, printed or saved to a
file. Themes do not alter what aesthetics or scales are used to plot the observa-
tions or summaries, but instead how text-labels, titles, axes, grids, plotting-area
background and grid, etc., are formatted and if displayed or not. Package ‘ggplot2’
includes several predefined theme constructors (usually described as themes), and
independently developed extension packages define additional ones. These con-
structors return complete themes, which when added to a plot, replace any theme
already present in whole. In addition to choosing among these already available
complete themes, users can modify the ones already present by adding incomplete
themes to a plot. When used in this way, incomplete themes usually are created on
the fly. It is also possible to create new theme constructors that return complete
themes, similar to theme_gray() from ‘ggplot2’.
Even the default theme_gray() can be added to a plot, to modify it, if arguments
different to the defaults are passed when called. In this example we override the
default base size with a larger one and the default sans-serif font with one with
serifs.
ggplot(fake2.data, aes(z, y)) +
geom_point() +
theme_gray(base_size = 15,
base_family = "serif")
60
50
40
y
30
20
0 5 10 15 20
z
U Change the code in the previous chunk to use, one at a time, each
of the predefined themes from ‘ggplot2’: theme_bw(), theme_classic(),
theme_minimal(), theme_linedraw(), theme_light(), theme_dark() and
theme_void().
in plot layers created with geometries, as their size is controlled by the size
aesthetic.
A frequent idiom is to create a plot without specifying a theme, and then adding
the theme when printing or saving it. This can save work, for example, when pro-
ducing different versions of the same plot for a publication and a talk.
It is also possible to change the theme used by default in the current R session
with theme_set().
theme_set(old_theme)
p
60
50
40
y
30
20
1000
1005
1010
1015
1020
z + 1000
When tick labels are rotated, one usually needs to set both the horizontal
and vertical justification, hjust and vjust, as the default values stop being
suitable. This is due to the fact that justification settings are referenced to the
text itself rather than to the plot, i.e., vertical justification of 𝑥-axis tick labels
rotated 90 degrees shifts their alignment with respect to tick marks along the
(horizontal) 𝑥 axis.
U Play with the code in the last chunk above, modifying the values used for
angle, hjust and vjust. (Angles are expressed in degrees, and justification with
values between 0 and 1).
A less elegant approach is to use a smaller font size. Within theme(), function
rel() can be used to set size relative to the base size. In this example, we use
axis.text.x so as to change the size of tick labels only for the 𝑥 axis.
U Modify the example above, so that the tick labels on the 𝑥-axis are blue
and those on the 𝑦-axis red, and the font size is the same for both axes, but
changed from the default. Consult the documentation for theme() to find out
the names of the elements that need to be given new values. For examples, see
ggplot2: Elegant Graphics for Data Analysis (Wickham and Sievert 2016) and R
Graphics Cookbook (Chang 2018).
thickness of axes, length of tick marks, grid lines, etc. However, in most cases these
are graphic design elements that are best kept consistent throughout sets of plots
and best handled by creating a new theme that can be easily reused.
If you both add a complete theme and want to modify some of its ele-
ments, you should add the whole theme before modifying it with + theme(...).
This may seem obvious once one has a good grasp of the grammar of graphics,
but can be at first disconcerting.
It is also possible to modify the default theme used for rendering all subsequent
plots.
W E
Wind direction
my_theme_gray <-
function (base_size = 11,
base_family = "serif",
base_line_size = base_size/22,
base_rect_size = base_size/22,
base_color = "darkblue") {
theme_gray(base_size = base_size,
base_family = base_family,
base_line_size = base_line_size,
base_rect_size = base_rect_size) +
theme(line = element_line(color = base_color),
rect = element_rect(color = base_color),
text = element_text(color = base_color),
title = element_text(color = base_color),
axis.text = element_text(color = base_color), complete = TRUE)
}
In the chunk above we have created our own theme constructor, without too
much effort, and using an approach that is very likely to continue working with
future versions of ‘ggplot2’. The saved theme is a function with parameters
and defaults for them. In this example we have kept the function parameters
the same as those used in ‘ggplot2’, only adding an additional parameter after
the existing ones to maximize compatibility and avoid surprising users. To
avoid surprising users, we may want additionally to make my_theme_gray() a
synonym of my_theme_gray() following ‘ggplot2’ practice.
Finally, we use the new theme constructor in the same way as those defined
in ‘ggplot2’.
Composing plots 281
W E
Wind direction
Next, we compose a plot using as panels the three plots created above (plot not
shown).
282 Grammar of graphics
(p1 | p2) / p3
We add a title and tag the panels with a letter. In this, and similar cases, paren-
theses may be needed to alter the default precedence of the R operators.
((p1 | p2) / p3) +
plot_annotation(title = "Fuel use in city traffic:", tag_levels = 'a')
35 35
30 30
25 25
cty
cty
20 20
15 15
10 10
2 3 4 5 6 7 2 3 4 5 6 7
displ displ
c
35
30
25
cty
20
15
10
a4
camry
civic
corolla
corvette
durango 4wd
explorer 4wd
grand prix
gti
impreza awd
jetta
malibu
maxima
mustang
sonata
tiburon
4runner 4wd
altima
camry solara
caravan 2wd
a4 quattro
a6 quattro
expedition 2wd
navigator 2wd
new beetle
passat
range rover
land cruiser wagon 4wd
mountaineer 4wd
pathfinder 4wd
ram 1500 pickup 4wd
factor(model)
Using expressions
b
a1 +
g
Speed (m s-1)
1
a
-0.4
5
a
-0.8
3
a
2
a
-1.2
-1.6
4
a
1 2 3 4 5
ai
We can also use a character string stored in a variable, and use function parse()
to parse it in cases where an expression is required as we do here for subtitle.
In this example we also set tick labels to expressions, taking advantage that
expression() accepts multiple arguments separated by commas returning a vector
of expressions.
my_eq.char <- "alpha[i]"
ggplot(my.data, aes(x, y)) +
geom_point() +
labs(title = parse(text = my_eq.char)) +
scale_x_continuous(name = expression(alpha[i]),
breaks = c(1,3,5),
labels = expression(alpha[1], alpha[3], alpha[5]))
ai
-0.4
-0.8
y
-1.2
-1.6
a1 a3 a5
ai
A different approach (no example shown) would be to use parse() explicitly for
each individual label, something that might be needed if the tick labels need to be
“assembled” programmatically instead of set as constants.
where a literal " is desired). We can, also in both cases, embed a character string
by means of one of the functions plain(), italic(), bold() or bolditalic()
which also affect the font used. The argument to these functions needs to be
a character string delimited by quotation marks if it is not to be parsed.
When using expression(), bare quotation marks can be embedded,
ggplot(cars, aes(speed, dist)) +
geom_point() +
xlab(expression(x[1]*" test"))
xlab(parse(text = "x[1]~~~~plain(test)"))
U Study the chunck above. If you are familiar with C or C++ function
sprintf() will already be familiar to you, otherwise study its help page.
Play with functions format(), sprintf(), and strftime(), formatting dif-
ferent types of data, into character strings of different widths, with different
numbers of digits, etc.
It is also possible to substitute the value of variables or, in fact, the result of
evaluation, into a new expression, allowing on the fly construction of expressions.
Such expressions are frequently used as labels in plots. This is achieved through
use of quoting and substitution.
We use bquote() to substitute variables or expressions enclosed in .( ) by their
value. Be aware that the argument to bquote() needs to be written as an expres-
sion; in this example we need to use a tilde, ~, to insert a space between words.
Furthermore, if the expressions include variables, these will be searched for in the
environment rather than in data, except within a call to aes().
100
75
dist
50
25
0
5 10 15 20 25
speed
100
75
dist
50
25
0
5 10 15 20 25
speed
deparse_test("constant")
## [1] "\"constant\""
deparse_test(1 + 2)
## [1] "1 + 2"
deparse_test(a)
## [1] "a"
= A new package, ‘ggtext’, which is not yet in CRAN, provides rich-text (basic
HTML and Markdown) support for ‘ggplot2’, both for annotations and for data
visualization. This package provides an alternative to the use of R expressions.
error-free. Afterwards, one can map additional aesthetics and add geometries and
statistics gradually. The final steps are then to add annotations and the text or
expressions used for titles, and axis and key labels. Another approach is to start
with an existing plot and modify it, e.g., by using the same plotting code with
different data or mapping different variables. When reusing code for a different
data set, scale limits and names are likely to need to be edited.
U Build a graphically complex data plot of your interest, step by step. By step
by step, I do not refer to using the grammar in the construction of the plot as
earlier, but of taking advantage of this modularity to test intermediate versions
in an iterative design process, first by building up the complex plot in stages
as a tool in debugging, and later using iteration in the processes of improving
the graphic design of the plot and improving its readability and effectiveness.
We assemble the final plot from the two parts we saved into variables. This is
useful when we need to create several plots ensuring that scale name arguments are
used consistently. In the example above, we saved these names, but the approach
can be used for other plot components or lists of components.
Creating sets of plots 289
myplot
myplot + mylabs + theme_bw(16)
myplot + mylabs + theme_bw(16) + ylim(0, NA)
U Revise the code you wrote for the “playground” exercise in section 7.13,
but this time, pre-building and saving groups of elements that you expect to
be useful unchanged when composing a different plot of the same type, or a
plot of a different type from the same data.
For Encapsulated Postscript and SVG output, we only need to substitute pdf()
with postscript() or svg(), respectively.
In the case of graphics devices for file output in BMP, JPEG, PNG and TIFF bitmap
formats, arguments passed to width and height are expressed in pixels.
= Some graphics devices are part of base-R, and others are implemented in
contributed packages. In some cases, there are multiple graphic device avail-
able for rendering graphics in a given file format. These devices usually use
Further reading 291
different libraries, or have been designed with different aims. These alternative
graphic devices can also differ in their function signature, i.e., have differences
in the parameters and their names. In cases when rendering fails inexplicably,
it can be worthwhile to switch to an alternative graphics device to find out if
the problem is in the plot or in the rendering engine.
Most programmers have seen them, and most good programmers re-
alize they’ve written at least one. They are huge, messy, ugly programs
that should have been short, clean, beautiful programs.
John Bentley
Programming Pearls, 1986
293
294 Data import and export
8.2 Introduction
The first step in any data analysis with R is to input or read-in the data. Available
sources of data are many and data can be stored or transmitted using various
formats, both based on text or binary encodings. It is crucial that data is not altered
(corrupted) when read and that in the eventual case of an error, errors are clearly
reported. Most dangerous are silent non-catastrophic errors.
The very welcome increase of awareness of the need for open availability of
data, makes the output of data from R into well-defined data-exchange formats
another crucial step. Consequently, in many cases an important step in data anal-
ysis is to export the data for submission to a repository, in addition to publication
of the results of the analysis.
Faster internet access to data sources and cheaper random-access memory
(RAM) has made it possible to efficiently work with relatively large data sets in
R. That R keeps all data in memory (RAM), imposes limits to the size of data R
functions can operate on. For data sets large enough not to fit in computer RAM,
one can use selective reading of data from flat files, or from databases outside of
R.
Some R packages have made it faster to import data saved in the same formats
already supported by base R, but in some cases providing weaker guarantees of
not corrupting the data than base R. Other contributed packages make it possible
to import and export data stored in file formats not supported by base R functions.
Some of these formats are subject-area specific while others are in widespread use.
install.packages(learnrbook::pkgs_ch_data)
To run the examples included in this chapter, you need first to load some pack-
ages from the library (see section 5.2 on page 163 for details on the use of pack-
ages).
library(learnrbook)
library(tibble)
library(purrr)
library(wrapr)
library(stringr)
library(dplyr)
library(tidyr)
library(readr)
library(readxl)
library(xlsx)
library(readODS)
library(pdftools)
library(foreign)
library(haven)
File names and operations 295
library(xml2)
library(XML)
library(ncdf4)
library(tidync)
library(lubridate)
library(jsonlite)
= Some data sets used in this and other chapters are available in package
‘learnrbook’. In addition to the R data objects, we provide files saved in foreign
formats, which we used in examples on how to import data. The files can be
either read from the R library, or from a copy in a local folder. In this chapter
we assume the user has copied the folder "extdata" from the package to a
working folder.
Copy the files using:
We also make sure the folder used to save data read from the internet,
exists.
save.path = "./data"
if (!dir.exists(save.path)) {
dir.create(save.path)
}
While in Unix and Linux folder nesting in file paths is marked with a
forward slash character (/), under MS-Windows it is marked with a backslash
character (\). Backslash (\) is an escape character in R and interpreted as the
start of an embedded special sequence of characters (see section 2.7 on page
39), while in R a forward slash (/) can be used for file paths under any OS,
and escaped backslash (\\) is valid only under MS-Windows. Consequently, /
should be always preferred to \\ to ensure portability, and is the approach
used in this book.
basename("extdata/my-file.txt")
## [1] "my-file.txt"
basename("extdata\\my-file.txt")
## [1] "my-file.txt"
Functions getwd() and setwd() can be used to get the path to the current work-
ing directory and to set a directory as current, respectively.
# not run
getwd()
Function setwd() returns the path to the current working directory, allowing
us to portably set the working directory to the previous one. Both relative paths
(relative to the current working directory), as in the example, or absolute paths
(given in full) are accepted as an argument. In mainstream OSs “.” indicates the
current directory and “..” the directory above the current one.
File names and operations 297
# not run
oldwd <- setwd("..")
getwd()
The returned value is always an absolute full path, so it remains valid even if
the path to the working directory changes more than once before being restored.
# not run
oldwd
setwd(oldwd)
getwd()
We can also obtain lists of files and/or directories (= disk folders) portably
across OSs.
head(list.files())
## [1] "abbrev.sty" "anscombe.svg" "appendixes.prj"
## [4] "appendixes.prj.bak" "bits" "chapters-removed"
head(list.dirs())
## [1] "." "./.git" "./.git/hooks" "./.git/info"
## [5] "./.git/logs" "./.git/logs/refs"
head(dir())
## [1] "abbrev.sty" "anscombe.svg" "appendixes.prj"
## [4] "appendixes.prj.bak" "bits" "chapters-removed"
U The default argument for parameter path is the current working directory,
under Windows, Unix, and Linux indicated by ".". Convince yourself that this
is indeed the default by calling the functions with an explicit argument. After
this, play with the functions trying other existing and non-existent paths in
your computer.
U Compare the behavior of functions dir() and list.dirs(), and try by over-
riding the default arguments of list.dirs(), to get the call to return the same
output as dir() does by default.
Base R provides several functions for portably working with files, and they are
listed in the help page for files and in individual help pages. Use help("files")
to access the help for this “family” of functions.
298 Data import and export
if (!file.exists("xxx.txt")) {
file.create("xxx.txt")
}
## [1] TRUE
file.size("xxx.txt")
## [1] 0
file.info("xxx.txt")
## size isdir mode mtime ctime
## xxx.txt 0 FALSE 666 2020-04-24 02:52:45 2020-04-24 02:52:45
## atime exe
## xxx.txt 2020-04-24 02:52:45 no
file.rename("xxx.txt", "zzz.txt")
## [1] TRUE
file.exists("xxx.txt")
## [1] FALSE
file.exists("zzz.txt")
## [1] TRUE
file.remove("zzz.txt")
## [1] TRUE
U Function file.path() can be used to construct a file path from its compo-
nents in a way that is portable across OSs. Look at the help page and play with
the function to assemble some paths that exist in the computer you are using.
readLines(f1, n = 2L)
## [1] "1.0,24.5,346,ABC" "23.4,45.6,78,Z Y"
close(f1)
When R is used in batch mode, the “files” stdin, stdout and stderror can be
opened, and data read from, or written to. These standard sources and sinks, so
familiar to C programmers, allow the use of R scripts as tools in data pipes coded
as shell scripts under Unix and other OSs.
Not all text files are born equal. When reading text files, and foreign bi-
nary files which may contain embedded text strings, there is potential for their
misinterpretation during the import operation. One common source of prob-
lems, is that column headers are to be read as R names. As earlier discussed,
there are strict rules, such as avoiding spaces or special characters if the names
are to be used with the normal syntax. On import, some functions will attempt
to sanitize the names, but others not. Most such names are still accessible in
R statements, but a special syntax is needed to protect them from triggering
syntax errors through their interpretation as something different than variable
or function names—in R jargon we say that they need to be quoted.
Some of the things we need to be on the watch for are: 1) Mismatches be-
tween the character encoding expected by the function used to read the file,
and the encoding used for saving the file—usually because of different lo-
cales. 2) Leading or trailing (invisible) spaces present in the character values
or column names—which are almost invisible when data frames are printed. 3)
Wrongly guessed column classes—a typing mistake affecting a single value in
a column, e.g., the wrong kind of decimal marker, prevents the column from
300 Data import and export
data.frame(al = 1, a1 = 2, aO = 3, a0 = 4)
## al a1 aO a0
## 1 1 2 3 4
Reading data from a text file can result in very odd-looking values stored
in R variables because of a mismatch in encoding, e.g., when a CSV file saved
with MS-Excel is silently encoded using 16-bit unicode format, but read as an
8-bit unicode encoded file.
The hardest part of all these problems is to diagnose their origin, as func-
tion arguments and working environment options can in most cases be used
to force the correct decoding of text files with diverse characteristics, origins
and vintages once one knows what is required. One function in the R ‘tools’
package, which is not exported, can at the time of writing be used to test files
for the presence on non-ASCII characters: tools:::showNonASCIIfile(). This
function takes as an argument the path to a file.
Plain-text files 301
col1,col2,col3,col4
1.0,24.5,346,ABC
23.4,45.6,78,Z Y
sapply(from_csv_a.df, class)
## col1 col2 col3 col4
## "numeric" "numeric" "integer" "character"
from_csv_a.df[["col4"]]
## [1] "ABC" "Z Y"
levels(from_csv_a.df[["col4"]])
## NULL
Although space characters are read as part of the fields, they are ignored when
conversion to numeric takes place. The remaining leading and trailing spaces in
character strings are difficult to see when data frames are printed.
302 Data import and export
Using levels() we can more clearly see that the labels of the automatically
created factor levels contain leading spaces.
sapply(from_csv_b.df, class)
## col1 col2 col3 col4
## "numeric" "numeric" "integer" "character"
from_csv_b.df[["col4"]]
## [1] " ABC" " Z Y"
levels(from_csv_b.df[["col4"]])
## NULL
By default, column names are sanitized but factor levels are not. By consulting
the documentation with help(read.csv) we discover that by passing an additional
argument we can change this default and obtain the data read as desired. Most
likely the default has been chosen so that by default data integrity is maintained.
from_csv_e.df[["col4"]]
## [1] "ABC" "Z Y"
levels(from_csv_e.df[["col4"]])
## NULL
The functions from the R ‘utils’ package by default convert columns containing
character strings into factors, as seen above. This default can be changed so that
character strings remain as is.
sapply(from_csv_c.df, class)
## col1 col2 col3 col4
## "numeric" "numeric" "integer" "character"
from_csv_c.df[["col4"]]
## [1] "ABC" "Z Y"
Decimal points and exponential notation are allowed for floating point values.
In English-speaking locales, the decimal mark is a point, while in many other locales
it is a comma. If a comma is used as decimal marker, we can no longer use it as
field separator and is usually substituted by a semicolon (;). In such a case we
can use read.csv2() and write.csv2. Furthermore, parameters dec and sep allow
setting them to arbitrary characters. Function read.table() does the actual work
and functions like read.csv() only differ in the default arguments for the different
parameters. By default, read.table() expects fields to be separated by white space
Plain-text files 303
(one or more spaces, tabs, new lines, or carriage return). Strings with embedded
spaces need to be quoted in the file as shown below.
sapply(from_txt_b.df, class)
## col1 col2 col3 col4
## "numeric" "numeric" "integer" "character"
from_txt_b.df[["col4"]]
## [1] "ABC" "Z Y"
levels(from_txt_b.df[["col4"]])
## NULL
10245346ABC
234456 78Z Y
sapply(from_fwf_a.df, class)
## col1 col2 col3 col4
## "numeric" "numeric" "numeric" "character"
from_fwf_a.df[["col4"]]
## [1] "ABC" "Z Y"
304 Data import and export
The file reading functions described above share with read.table() the
same parameters. In addition to those described above, other frequently use-
ful parameters are skip and n, which can be used to skip lines at the top of a
file and limit the number of lines (or records) to read; header, which accepts a
logical argument indicating if the fields in the first text line read should be de-
coded as column names rather than data; na.strings, to which can be passed
a character vector with strings to be interpreted as NA; and colClasses, which
provides control of the conversion of the fields to R classes and possibly skip-
ping some columns altogether. All these parameters are described in the cor-
responding help pages.
Next we give one example of the use of a write function matching one of the
read functions described above. The write.csv() function takes as an argument
a data frame, or an object that can be coerced into a data frame, converts it to
character strings, and saves them to a text file. We first create the data frame that
we will write to disk.
We write my.df to a CSV file suitable for an English language locale, and then
display its contents.
"x","y","z"
1,0.5,"a"
2,0.4,"b"
3,0.3,"c"
4,0.2,"d"
5,0.1,"e"
Plain-text files 305
U Write the data frame my.df into text files with functions write.csv2() and
write.table() instead of read.csv() and display the files.
Function cat() takes R objects and writes them after conversion to character
strings to the console or a file, inserting one or more characters as separators, by
default, a space. This separator can be set through parameter sep. In our example
we set sep to a new line (entered as the escape sequence "\n").
abcd
hello world
123.45
8.6.2 ‘readr’
Package ‘readr’ is part of the ‘tidyverse’ suite. It defines functions that allow faster
input and output, and have different default behavior. Contrary to base R func-
tions, they are optimized for speed, but may sometimes wrongly decode their input
and rarely even silently do this. Base R functions do less guessing, e.g., the delim-
iters must be supplied as arguments. The ‘readr’ functions guess more properties
of the text file format; in most cases they succeed, which is very handy, but occa-
sionally they fail. Automatic guessing can be overridden by passing arguments and
this is recommended for scripts that may be reused to read different files in the
future. Another important advantage is that these functions read character strings
formatted as dates or times directly into columns of class POSIXct. All write func-
tions defined in ‘readr’ have an append parameter, which can be used to change the
default behavior of overwriting an existing file with the same name, to appending
the output at its end.
Although in this section we exemplify the use of these functions by passing a
file name as an argument, as is the case with R native functions, URLs, and open
file descriptors are also accepted (see section 8.5 on page 298). Furthermore, if the
file name ends in a tag recognizable as indicating a compressed file format, the file
will be uncompressed on the fly.
306 Data import and export
As we can see in this first example, these functions also report to the console
the specifications of the columns, which is important when these are guessed from
the file contents, or part of it.
read_csv(file = "extdata/aligned-ASCII-UK.csv")
read_csv(file = "extdata/not-aligned-ASCII-UK.csv")
read_table(file = "extdata/aligned-ASCII.txt")
## col1 = col_double(),
## col2 = col_double(),
## col3 = col_double(),
## col4 = col_character()
## )
## # A tibble: 2 x 4
## col1 col2 col3 col4
## <dbl> <dbl> <dbl> <chr>
## 1 1 24.5 346 "ABC"
## 2 23.4 45.6 78 "\"Z Y\""
read_table2(file = "extdata/not-aligned-ASCII.txt")
Function read_tsv() reads files encoded with the tab character as the delimiter,
and read_fwf() reads files with fixed width fields. There is, however, no equivalent
to read.fortran(), supporting implicit decimal points.
308 Data import and export
U Use the ”wrong” read_ functions to read the example files used above
and/or your own files. As mentioned earlier, forcing errors will help you learn
how to diagnose when such errors are caused by coding mistakes. In this case,
as wrongly read data are not always accompanied by error or warning mes-
sages, carefully check the returned tibbles for misread data values.
The functions from R’s ‘utils’ read the whole file as text before attempting
to guess the class of the columns or their alignment. This is reliable but slow
for very large text files. The functions from ‘readr’ read only the top 1000 lines
by default for guessing, and then rather blindly read the whole files assuming
that the guessed properties also apply to the remainder of the file. This is more
efficient, but somehow risky. In earlier versions of ‘readr’, a typical failure to
correctly decode fields was when numbers are in increasing order and the field
widths continue increasing in the lines below those used for guessing, but this
case seems to be, at the time of writing correctly, handled. It also means that
in cases when an individual value after guess_max lines cannot be converted to
numeric, instead of returning a column of character strings as base R functions,
this value is encoded as a numeric NA with a warning. To demonstrate this we
will drastically reduce guess_max from its default so that we can use an example
file only a few lines in length.
read_table2(file = "extdata/miss-aligned-ASCII.txt")
The write_ functions from ‘readr’ are the counterpart to write. functions from
‘utils’. In addition to the expected write_csv(), write_csv2(), write_tsv() and
write_delim(), ‘readr’ provides functions that write MS-Excel-friendly CSV files.
We demonstrate here the use of write_excel_csv() to produce a text file with
comma-separated fields suitable for import into MS-Excel.
The pair of functions read_lines() and write_lines() read and write charac-
ter vectors without conversion, similarly to base R readLines() and writeLines().
Functions read_file() and write_file() read and write the contents of a whole
text file into, and from, a single character string. Functions read_file() and
write_file() can also be used with raw vectors to read and write binary files or
text files of unknown encoding.
The contents of the whole file are returned as a character vector of length one,
310 Data import and export
with the embedded new line markers. We use cat() to print it so these new line
characters force the start of a new print-out line.
cat(one.str)
## col1 col2 col3 col4
## a 20 2500 abc
8.7.1 ‘xml2’
Package ‘xml2’ provides functions for reading and parsing XTML and HTML files.
This is a vast subject, of which I will only give a brief example.
We first read a web page with function read_html(), and explore its structure.
## {text}
## <hr>
## <h1>
## {text}
## {text}
## <hr>
## <p>
## {text}
## <a [href]>
## {text}
## {text}
## {text}
## <p>
## {text}
## <a [href]>
## {text}
## {text}
## {text}
## <address>
## {text}
## {text}
Next we extract the text from its title attribute, using functions
xml_find_all() and xml_text().
xml_text(xml_find_all(web_page, ".//title"))
## [1] "Suite of R packages for photobiology"
The functions defined in this package can be used to “harvest” data from web
pages, but also to read data from files using formats that are defined through XML
schemas.
I have passed all arguments by name to make explicit how this pipe works. See
section 6.5 on page 187 for details on the use of the pipe and dot-pipe operators.
8.9 Worksheets
Microsoft Office, Open Office and Libre Office are the most frequently used suites
containing programs based on the worksheet paradigm. There is available a stan-
dardized file format for exchange of worksheet data, but it does not support all
the features present in native file formats. We will start by considering MS-Excel.
The file format used by MS-Excel has changed significantly over the years, and old
formats tend to be less well supported by available R packages and may require the
file to be updated to a more modern format with MS-Excel itself before import into
R. The current format is based on XML and relatively simple to decode, whereas
older binary formats are more difficult. Worksheets contain code as equations in
addition to the actual data. In all cases, only values entered as such or those com-
puted by means of the embedded equations can be imported into R rather than
the equations themselves.
8.9.2 ‘readxl’
Package ‘readxl’ supports reading of MS-Excel workbooks, and selecting work-
sheets and regions within worksheets specified in ways similar to those used by
Worksheets 313
MS-Excel itself. The interface is simple, and the package easy to install. We will
import a file that in MS-Excel looks like the screen capture below.
We first list the sheets contained in the workbook file with excel_sheets().
In this case, the argument passed to sheet is redundant, as there is only a sin-
gle worksheet in the file. It is possible to use either the name of the sheet or a
positional index (in this case 1 would be equivalent to "my data"). We use function
read_excel() to import the worksheet. Being part of the ‘tidyverse’ the returned
value is a tibble and character columns are returned as is.
## 2 2 a
## 3 3 a
## # ... with 4 more rows
Of the remaining arguments, the most useful ones have the same names and
play similar roles as in ‘readr’ (see section 8.6.2 on page 305).
8.9.3 ‘xlsx’
Package ‘xlsx’ can be more difficult to install as it uses Java functions to do the
actual work. However, it is more comprehensive, with functions both for reading
and writing MS-Excel worksheets and workbooks, in different formats including
the older binary ones. Similar to ‘readr’ it allows selected regions of a worksheet
to be imported.
Here we use function read.xlsx(), indexing the worksheet by name. The re-
turned value is a data frame, and following the expectations of R package ‘utils’,
character columns are converted into factors by default.
With function write.xlsx() we can write data frames out to Excel worksheets
and even append new worksheets to an existing workbook.
set.seed(456321)
my.data <- data.frame(x = 1:10, y = letters[1:10])
write.xlsx(my.data, file = "extdata/my-data.xlsx", sheetName = "first copy")
write.xlsx(my.data, file = "extdata/my-data.xlsx", sheetName = "second copy", ap-
pend = TRUE)
U If you have some worksheet files available, import them into R to get a feel
for how the way in which data is organized in the worksheets affects how easy
or difficult it is to import them into R.
8.9.4 ‘readODS’
Package ‘readODS’ provides functions for reading data saved in files that follow
the Open Documents Standard. Function read_ods() has a similar but simpler user
interface to that of read_excel() and reads one worksheet at a time, with support
only for skipping top rows. The value returned is a data frame.
ods.df
## sample group observation
## 1 1 a 1.0
## 2 2 a 5.0
## 3 3 a 7.0
## 4 4 a 2.0
## 5 5 a 5.0
## 6 6 b 0.0
316 Data import and export
## 7 7 b 2.0
## 8 8 b 3.0
## 9 9 b 1.0
## 10 10 b 1.5
8.10.1 ‘foreign’
Functions in package ‘foreign’ allow us to import data from files saved by several
statistical analysis programs, including SAS, Stata, SPPS, Systat, Octave among oth-
ers, and a function for writing data into files with formats native to SAS, Stata, and
SPPS. R documents the use of these functions in detail in the R Data Import/Export
manual. As a simple example, we use function read.spss() to read a .sav file,
saved a few years ago with the then current version of SPSS. We display only the
first six rows and seven columns of the data frame, including a column with dates,
which appears as numeric.
A second example, this time with a simple .sav file saved 15 years ago.
Another example, for a Systat file saved on an PC more than 20 years ago, and
read with read.systat().
my_systat.df <- read.systat(file = "extdata/BIRCH1.SYS")
head(my_systat.df)
## CONT DENS BLOCK SEEDL VITAL BASE ANGLE HEIGHT DIAM
## 1 1 1 1 2 44 2 0 1 53
## 2 1 1 1 2 41 2 1 2 70
## 3 1 1 1 2 21 2 0 1 65
## 4 1 1 1 2 15 3 0 1 79
## 5 1 1 1 2 37 3 0 1 71
## 6 1 1 1 2 29 2 1 1 43
Not all functions in ‘foreign’ return data frames by default, but all of them can
be coerced to do so.
8.10.2 ‘haven’
Package ‘haven’ is less ambitious with respect to the number of formats supported,
or their vintages, providing read and write functions for only three file formats:
SAS, Stata and SPSS. On the other hand, ‘haven’ provides flexible ways to convert
the different labeled values that cannot be directly mapped to R modes. They also
decode dates and times according to the idiosyncrasies of each of these file for-
mats. In cases when the imported file contains labeled values the returned tibble
object needs some additional attention from the user. Labeled numeric columns in
SPSS are not necessarily equivalent to factors, although they sometimes are. Con-
sequently, conversion to factors cannot be automated and must be done manually
in a separate step.
We can use function read_sav() to import a .sav file saved by a recent version
of SPSS. As in the previous section, we display only the first six rows and seven
columns of the data frame, including a column treat containing a labeled numeric
vector and harvest_date with dates encoded as R date values.
my_spss.tb <- read_sav(file = "extdata/my-data.sav")
my_spss.tb[1:6, c(1:6, 17)]
## # A tibble: 6 x 7
## block treat mycotreat water1 pot harvest harvest_date
## <dbl> <dbl+lbl> <dbl> <dbl> <dbl> <dbl> <date>
## 1 0 1 [Watered, EM] 1 1 14 1 2015-06-15
## 2 0 1 [Watered, EM] 1 1 52 1 2015-06-15
## 3 0 1 [Watered, EM] 1 1 111 1 2015-06-15
## # ... with 3 more rows
U If you use or have in the past used other statistical software or a general-
purpose language like Python, look for some old files and import them into
R.
8.11.1 ‘ncdf4’
Package ‘ncdf4’ supports reading of files using netCDF version 4 or earlier formats.
Functions in ‘ncdf4’ not only allow reading and writing of these files, but also their
modification.
We first read metadata to obtain an index of the file contents, and in additional
steps, read a subset of the data. With print() we can find out the names and
characteristics of the variables and attributes. In this example, we read long-term
averages for potential evapotranspiration (PET).
We first open a connection to the file with function nc_open().
U Increase max.level in the call to str() above and study the connection
object stores information on the dimensions and for each data variable. You
can also print(meteo_data.nc) for a more complete printout once you have
understood the structure of the object.
The dimensions of the array data are described with metadata, in our examples
mapping indexes to a grid of latitudes and longitudes and into a time vector as
a third dimension. The dates are returned as character strings. We get here the
variables one at a time with function ncvar_get().
The time vector is rather odd, as it contains only monthly data as these are
long-term averages, but expressed as days from 1800-01-01 corresponding to the
first day of each month of year 1. We use package ‘lubridate’ for the conversion.
We construct a tibble object with PET values for one grid point, taking advan-
tage of the recycling of short vectors.
pet.tb <-
tibble(time = ncvar_get(meteo_data.nc, "time"),
month = month(ymd("1800-01-01") + days(time)),
lon = longitude[6],
lat = latitude[2],
pet = ncvar_get(meteo_data.nc, "pevpr")[6, 2, ]
)
pet.tb
## # A tibble: 12 x 5
## time month lon lat pet
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -657073 12 9.38 86.7 4.28
## 2 -657042 1 9.38 86.7 5.72
## 3 -657014 2 9.38 86.7 4.38
## # ... with 9 more rows
If we want to read in several grid points, we can use several different ap-
proaches. However, the order of nesting of dimensions can make adding the di-
mensions as columns error prone. It is much simpler to use package ‘tidync’ de-
scribed next.
8.11.2 ‘tidync’
Package ‘tidync’ provides functions that make it easier to extract subsets of the
data from an NetCDF file. We start by doing the same operations as in the examples
for ‘ncdf4’.
We open the file creating an object and simultaneously activating the first grid.
meteo_data.tnc <- tidync("extdata/pevpr.sfc.mon.ltm.nc")
meteo_data.tnc
##
## Data Source (1): pevpr.sfc.mon.ltm.nc ...
##
## Grids (5) <dimension family> : <associated variables>
##
## [1] D0,D1,D2 : pevpr, valid_yr_count **ACTIVE GRID** ( 216576 values per variable)
## [2] D3,D2 : climatology_bounds
## [3] D0 : lon
## [4] D1 : lat
## [5] D2 : time
##
## Dimensions 4 (3 active):
##
## dim name length min max start count dmin dmax unlim coord_dim
## <chr> <chr> <dbl> <dbl> <dbl> <int> <int> <dbl> <dbl> <lgl> <lgl>
## 1 D0 lon 192 0. 3.58e2 1 192 0. 3.58e2 FALSE TRUE
## 2 D1 lat 94 -8.85e1 8.85e1 1 94 -8.85e1 8.85e1 FALSE TRUE
## 3 D2 time 12 -6.57e5 -6.57e5 1 12 -6.57e5 -6.57e5 FALSE TRUE
##
## Inactive dimensions:
NetCDF files 321
##
## dim name length min max unlim coord_dim
## <chr> <chr> <dbl> <dbl> <dbl> <lgl> <lgl>
## 1 D3 nbnds 2 1 2 FALSE FALSE
hyper_dims(meteo_data.tnc)
## # A tibble: 3 x 7
## name length start count id unlim coord_dim
## * <chr> <dbl> <int> <int> <int> <lgl> <lgl>
## 1 lon 192 1 192 0 FALSE TRUE
## 2 lat 94 1 94 1 FALSE TRUE
## 3 time 12 1 12 2 FALSE TRUE
hyper_vars(meteo_data.tnc)
## # A tibble: 2 x 6
## id name type ndims natts dim_coord
## <int> <chr> <chr> <int> <int> <lgl>
## 1 4 pevpr NC_FLOAT 3 14 FALSE
## 2 5 valid_yr_count NC_FLOAT 3 4 FALSE
We extract a subset of the data into a tibble in long (or tidy) format, and add
the months using a pipe operator from ‘wrapr’ and methods from ‘dplyr’.
hyper_tibble(meteo_data.tnc,
lon = signif(lon, 1) == 9,
lat = signif(lat, 2) == 87) %.>%
mutate(.data = ., month = month(ymd("1800-01-01") + days(time))) %.>%
select(.data = ., -time)
## # A tibble: 12 x 5
## pevpr valid_yr_count lon lat month
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.28 1.19e-39 9.38 86.7 12
## 2 5.72 1.19e-39 9.38 86.7 1
## 3 4.38 1.29e-39 9.38 86.7 2
## # ... with 9 more rows
In this second example, we extract data for all grid points along latitudes. To
achieve this we need only to omit the test for lat from the chuck above. The tib-
ble is assembled automatically and columns for the active dimensions added. The
decoding of the months remains unchanged.
hyper_tibble(meteo_data.tnc,
lon = signif(lon, 1) == 9) %.>%
mutate(.data = ., month = month(ymd("1800-01-01") + days(time))) %.>%
select(.data = ., -time)
## # A tibble: 1,128 x 5
## pevpr valid_yr_count lon lat month
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.02 1.19e-39 9.38 88.5 12
## 2 4.28 1.19e-39 9.38 86.7 12
## 3 3.03 9.18e-40 9.38 84.8 12
## # ... with 1,125 more rows
322 Data import and export
U Instead of extracting data for one longitude across latitudes, extract data
across longitudes for one latitude near the Equator.
logger.df <-
read.csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
header = FALSE,
col.names = c("time", "temperature"))
sapply(logger.df, class)
## time temperature
## "character" "numeric"
sapply(logger.df, mode)
## time temperature
## "character" "numeric"
logger.tb <-
read_csv2(file = "http://r4photobiology.info/learnr/logger_1.txt",
col_names = c("time", "temperature"))
## Using ',' as decimal and '.' as grouping mark. Use read_delim() for more control.
## Parsed with column specification:
## cols(
## time = col_character(),
## temperature = col_double()
## )
sapply(logger.tb, class)
## time temperature
## "character" "numeric"
sapply(logger.tb, mode)
## time temperature
## "character" "numeric"
While functions in package ‘readr’ support the use of URLs, those in packages
‘readxl’ and ‘xlsx’ do not. Consequently, we need to first download the file and save
Remotely located data 323
a copy locally, that we can read as described in section 8.9.2 on page 312. Function
download.file() in the R ‘utils’ package can be used to download files using URLs. It
supports different modes such as binary or text, and write or append, and different
methods such as "internal", "wget" and "libcurl" .
download.file("http://r4photobiology.info/learnr/my-data.xlsx",
"data/my-data-dwn.xlsx",
mode = "wb")
remote_my_spss.tb <-
read_sav(file = "http://r4photobiology.info/learnr/thiamin.sav")
remote_my_spss.tb
## # A tibble: 24 x 2
## THIAMIN CEREAL
## <dbl> <dbl+lbl>
## 1 5.2 1 [wheat]
## 2 4.5 1 [wheat]
## 3 6 1 [wheat]
## # ... with 21 more rows
In this example we use a downloaded NetCDF file of long-term means for poten-
tial evapotranspiration from NOOA, the same used above in the ‘ncdf4’ example.
This is a moderately large file at 444 KB. In this case, we cannot directly open the
connection to the NetCDF file, and we first download it (commented out code, as
we have a local copy), and then we open the local file.
my.url <- paste("ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/",
"surface_gauss/pevpr.sfc.mon.ltm.nc",
sep = "")
#download.file(my.url,
# mode = "wb",
# destfile = "extdata/pevpr.sfc.mon.ltm.nc")
pet_ltm.nc <- nc_open("extdata/pevpr.sfc.mon.ltm.nc")
324 Data import and export
8.13.1 ‘jsonlite’
We give here a simple example using a module from the YoctoPuce (http://www.
yoctopuce.com/) family using a software hub running locally. We retrieve logged
data from a YoctoMeteo module.
Here we use function fromJSON() from package ‘jsonlite’ to retrieve logged data
from one sensor.
The minimum, mean, and maximum values for each logging interval need to be
split from a single vector. We do this by indexing with a logical vector (recycled).
The data returned is in long form, with quantity names and units also returned by
the module, as well as the time.
dplyr::transmute(.data = .,
utc.time = as.POSIXct(utc, origin = "1970-01-01", tz = "UTC"),
hr_min = unlist(val)[c(TRUE, FALSE, FALSE)],
hr_mean = unlist(val)[c(FALSE, TRUE, FALSE)],
hr_max = unlist(val)[c(FALSE, FALSE, TRUE)]) -> humidity.df
full_join(temperature.df, humidity.df)
Most YoctoPuce input modules have a built-in datalogger, and the stored
data can also be downloaded as a CSV file through a physical or virtual hub.
As shown above, it is possible to control them through the HTML server in the
physical or virtual hubs. Alternatively the R package ‘reticulate’ can be used
to control YoctoPuce modules by means of the Python library giving access to
their API.
8.14 Databases
One of the advantages of using databases is that subsets of cases and variables
can be retrieved, even remotely, making it possible to work in R both locally and
remotely with huge data sets. One should remember that R natively keeps whole
objects in RAM, and consequently, available machine memory limits the size of
data sets with which it is possible to work. Package ‘dbplyr’ provides the tools to
work with data in databases using the same verbs as when using ‘dplyr’ with data
stored in memory (RAM) (see chapter 6). This is an important subject, but extensive
enough to be outside the scope of this book. We provide a few simple examples
to show the very basics but interested readers should consult R for Data Science
(Wickham and Grolemund 2017).
The additional steps compared to using ‘dplyr’ start with the need to establish a
connection to a local or remote database. We will use R package ‘RSQLite’ to create
a local temporary SQLite database. ‘dbplyr’ backends supporting other database
systems are also available. We will use meteorological data from ‘learnrbook’ for
this example.
library(dplyr)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
copy_to(con, weather_wk_25_2019.tb, "weather",
temporary = FALSE,
indexes = list(
c("month_name", "calendar_year", "solar_time"),
"time",
"sun_elevation",
"was_sunny",
"day_of_year",
"month_of_year"
)
326 Data import and export
)
weather.db <- tbl(con, "weather")
colnames(weather.db)
## [1] "time" "PAR_umol" "PAR_diff_fr" "global_watt"
## [5] "day_of_year" "month_of_year" "month_name" "calendar_year"
## [9] "solar_time" "sun_elevation" "sun_azimuth" "was_sunny"
## [13] "wind_speed" "wind_direction" "air_temp_C" "air_RH"
## [17] "air_DP" "air_pressure" "red_umol" "far_red_umol"
## [21] "red_far_red"
weather.db %.>%
filter(., sun_elevation > 5) %.>%
group_by(., day_of_year) %.>%
summarise(., energy_Wh = sum(global_watt, na.rm = TRUE) * 60 / 3600)
## # Source: lazy query [?? x 2]
## # Database: sqlite 3.30.1 [:memory:]
## day_of_year energy_Wh
## <dbl> <dbl>
## 1 162 7500.
## 2 163 6660.
## 3 164 3958.
## # ... with more rows
Package ‘dbplyr’ translates data pipes that use ‘dplyr’ syntax into SQL
queries to databases, either local or remote. As long as there are no problems
with the backend, the use of a database is almost transparent to the R user.
= It is always good to clean up, and in the case of the book, the best way to
test that the examples can be run in a “clean” system.
327
328 Bibliography
Faraway, J. J. (2006). Extending the linear model with R: generalized linear, mixed
effects and nonparametric regression models. Chapman & Hall/CRC Taylor &
Francis Group, p. 345. isbn: 158488424X (cit. on p. 161).
Gandrud, C. (2015). Reproducible Research with R and R Studio. 2nd ed. Chapman
& Hall/CRC The R Series). Chapman and Hall/CRC. 323 pp. isbn: 1498715370
(cit. on pp. 9, 10).
Hamming, R. W. (Mar. 1, 1987). Numerical Methods for Scientists and Engineers.
Dover Publications Inc. 752 pp. isbn: 0486652416.
Hillebrand, J. and M. H. Nierhoff (2015). Mastering RStudio: Develop, Communicate,
and Collaborate with R. Packt Publishing. 348 pp. isbn: 9781783982554 (cit. on
p. 8).
Holmes, S. and W. Huber (Mar. 1, 2019). Modern Statistics for Modern Biology. Cam-
bridge University Press. 382 pp. isbn: 1108705294 (cit. on p. 161).
Hughes, T. P. (2004). American Genesis. The University of Chicago Press. 530 pp.
isbn: 0226359271 (cit. on p. 92).
Johnson, K. A. and R. S. Goody (2011). “The Original Michaelis Constant: Transla-
tion of the 1913 Michaelis–Menten Paper”. In: Biochemistry 50, pp. 8264–8269.
doi: 10.1021/bi201284u (cit. on p. 141).
Kernigham, B. W. and P. J. Plauger (1981). Software Tools in Pascal. Reading, Mas-
sachusetts: Addison-Wesley Publishing Company. 366 pp. (cit. on pp. 180, 187).
Kernighan, B. W. and R. Pike (1999). The Practice of Programming. Addison Wesley.
288 pp. isbn: 020161586X (cit. on p. 15).
Knuth, D. E. (1984). “Literate programming”. In: The Computer Journal 27.2, pp. 97–
111 (cit. on pp. 9, 91).
Lamport, L. (1994). LATEX: a document preparation system. English. 2nd ed. Reading:
Addison-Wesley, p. 272. isbn: 0-201-52983-1 (cit. on p. 91).
Leisch, F. (2002). “Dynamic generation of statistical reports using literate data
analysis”. In: Proceedings in Computational Statistics. Compstat 2002. Ed. by W.
Härdle and B. Rönz. Heidelberg, Germany: Physika Verlag, pp. 575–580. isbn:
3-7908-1517-9 (cit. on p. 9).
Lemon, J. (2020). Kickstarting R. url: https : / / cran . r - project . org / doc /
contrib/Lemon-kickstart/kr_intro.html (visited on 02/07/2020).
Loo, M. P. van der and E. de Jonge (2012). Learning RStudio for R Statistical Comput-
ing. 1st ed. Birmingham: Packt Publishing, p. 126. isbn: 9781782160601 (cit. on
p. 8).
Matloff, N. (2011). The Art of R Programming: A Tour of Statistical Software Design.
No Starch Press, p. 400. isbn: 1593273843 (cit. on pp. 86, 117, 179).
Murrell, P. (2011). R Graphics. 2nd ed. CRC Press, p. 546. isbn: 1439831769 (cit. on
p. 203).
— (2019). R Graphics. 3rd ed. Portland: CRC Press/Taylor & Francis. 423 pp. isbn:
1498789056 (cit. on pp. 83, 291).
Newham, C. and B. Rosenblatt (June 1, 2005). Learning the bash Shell. O’Reilly UK
Ltd. 352 pp. isbn: 0596009658 (cit. on p. 15).
Peng, R. D. (2016). R Programming for Data Science. Leanpub. 182 pp. url: https:
//leanpub.com/rprogramming (visited on 07/31/2019) (cit. on pp. 86, 181).
Pinheiro, J. C. and D. M. Bates (2000). Mixed-Effects Models in S and S-Plus. New
York: Springer (cit. on pp. 161, 164).
Bibliography 329
Sarkar, D. (2008). Lattice: Multivariate Data Visualization with R. 1st ed. Springer,
p. 268. isbn: 0387759689 (cit. on pp. 83, 164, 203).
Smith, H. F. (1957). “Interpretation of adjusted treatment means and regressions
in analysis of covariance”. In: Biometrics 13, pp. 281–308 (cit. on p. 138).
Tufte, E. R. (1983). The Visual Display of Quantitative Information. Cheshire, CT:
Graphics Press. 197 pp. isbn: 0-9613921-0-X (cit. on p. 241).
Venables, W. N. and B. D. Ripley (2002). Modern Applied Statistics with S. 4th. New
York: Springer. isbn: 0-387-95457-0 (cit. on p. 161).
Wickham, H. (2015). R Packages. O’Reilly Media. isbn: 9781491910542 (cit. on
pp. 164, 177).
Wickham, H. (2019). Advanced R. 2nd ed. Taylor & Francis Inc. 588 pp. isbn:
0815384572 (cit. on pp. 117, 177).
Wickham, H. and G. Grolemund (2017). R for Data Science. O’Reilly UK Ltd. isbn:
1491910399 (cit. on pp. 181, 201, 325).
Wickham, H. and C. Sievert (2016). ggplot2: Elegant Graphics for Data Analysis.
2nd ed. Springer. XVI + 260. isbn: 978-3-319-24277-4 (cit. on pp. 164, 203, 278,
291).
Wood, S. N. (2017). Generalized Additive Models. Taylor & Francis Inc. 476 pp. isbn:
1498728332 (cit. on p. 161).
Xie, Y. (2013). Dynamic Documents with R and knitr. The R Series. Chapman and
Hall/CRC, p. 216. isbn: 1482203537 (cit. on pp. 9, 10, 91).
— (2016). bookdown: Authoring Books and Technical Documents with R Markdown.
Chapman and Hall/CRC. isbn: 9781138700109 (cit. on p. 91).
Xie, Y., J. J. Allaire, and G. Grolemund (2018). R Markdown. Chapman and Hall/CRC.
304 pp. isbn: 1138359335 (cit. on p. 91).
Zachry, M. and C. Thralls (Oct. 2004). “An Interview with Edward R. Tufte”.
In: Technical Communication Quarterly 13.4, pp. 447–462. doi: 10 . 1207 /
s15427625tcq1304_5.
Zuur, A. F., E. N. Ieno, and E. Meesters (2009). A Beginner’s Guide to R. 1st ed.
Springer, p. 236. isbn: 0387938362 (cit. on p. 161).
General index
331
332 General index
272, 275, 276, 280, 281, 283, various line and path
287, 288, 291 geometries, 222–224
‘ggpmisc’, 233, 237, 244, 245 graphic output devices, 290
‘ggrepel’, 232 ‘grid’, 203, 236, 270
‘ggtern’, 206 grid graphics coordinate systems,
‘ggtext’, 287 237
Git, 15, 164 ‘gridExtra’, 234
GitHub, 164 group-wise operations on data,
GLM, see generalized linear models 195–197
Gnu S, 2 grouping
grammar of graphics, 204, 287 implementation in tidyverse, 196
annotations, 269–272
‘haven’, 316, 317, 323
cartesian coordinates, 204, 210
‘Hmisc’, 240
color and fill scales, 266–269
HTML, 287, 310
column geometry, 225–226
complete themes, 275–277 IDE, see integrated development
continuous scales, 258–263 environment
coordinates, 206 IDE for R, 14
creating a theme, 279–281 ImageJ, 14
data, 205 importing data
discrete scales, 265–266 .ods files, 315–316
elements, 204–206 .xlsx files, 312–315
facets, 252–255 databases, 325–326
function statistic, 238–239 GPX files, 311–312
geometries, 205, 216–237 jsonlite, 324
identity color scales, 269 NeCDF files, 318–322
incomplete themes, 277–279 other statistical software,
inset-related geometries, 316–318
233–237 physical devices, 324–325
mapping of data, 205, 215–216 remote connections, 322–324
plot construction, 207–213 text files, 299–309
plots as R objects, 214–215 worksheets and workbooks,
312–316
point geometry, 217–221
XML and HTML files, 310–311
polar coordinates, 272–275
inequality and equality tests, 35–36
scales, 206, 255–269
integrated development
sf geometries, 228
environment, 7
size scales, 266
internet-of-things, 324
statistics, 205, 238–251 iteration, 104
structure of plot objects, 214 for loop, 100
summary statistic, 239–241 nesting of loops, 105
text and label geometries, repeat loop, 103
228–232 while loop, 102
repulsive, 232
themes, 206, 275–281 Java, 15, 164
tile geometry, 226–228 joins between data sources, 198–201
time and date scales, 264–265 filtering, 200
334 General index
339
340 Index of R names by category
any(), 30 diag(), 56
aov(), 135, 153 dim(), 51, 77, 318
append(), 22, 63 dim()<-, 77
arrange(), 194 dimnames(), 318
array(), 54 dir(), 297
as.character(), 42, 58 dirname(), 296
as.data.frame(), 185 dist, 159
as.formula(), 148 do.call(), 115, 116
as.integer(), 43 double(), 21
as.logical(), 42 download.file(), 323
as.matrix(), 51 effects(), 132
as.numeric(), 42, 43, 58 ends_with(), 195
as.ts(), 151 excel_sheets(), 313
as.vector(), 55 exp(), 18, 19
as_tibble(), 183 expand_limits(), 259
assign(), 113, 114, 167 expression(), 283–285
attach(), 72 facet_grid(), 252
attr(), 77 facet_wrap(), 252, 254, 274
attr()<-, 77 factor(), 56, 57, 59, 60
attributes(), 77, 127, 318 file.path(), 298
basename(), 296 filter(), 194
BIC(), 131 fitted(), 132
biplot(), 156 fitted.values(), 132
bold(), 285 format(), 43, 44, 286
bolditalic(), 285 fromJSON(), 324
bquote(), 286 full_join(), 198
c(), 22 function(), 167
cat(), 40, 41, 305 gather(), 190, 192
ceiling(), 28 geom_abline, 224
citation(), 12 geom_area, 223
class(), 41, 67, 127, 185, 318 geom_area(), 223, 266
coef(), 132, 136 geom_bar(), 225, 245, 266, 273,
coefficients(), 132 274
colnames(), 81 geom_bin2d(), 247
comment(), 77 geom_boxplot(), 249
comment()<-, 77 geom_col(), 225, 226, 266
contains(), 195 geom_curve(), 222
coord_cartesian(), 247 geom_density(), 248
coord_fixed(), 247 geom_density_2d(), 249
coord_polar(), 272 geom_errorbar, 221
cor(), 125–127 geom_errorbar(), 241
cor.test(), 126, 127 geom_grob(), 233, 236
crossprod(), 56 geom_grob_npc(), 237
cutree(), 160 geom_hex(), 247
data(), 78, 79 geom_histogram(), 245, 247
data.frame(), 66, 184, 187 geom_hline, 224
decompose(), 152 geom_hline(), 266, 272
Index of R names by category 341
write_csv(), 309 :, 23
write_csv2(), 309 <, 31
write_delim(), 309 <-, 19, 20, 48, 73
write_excel_csv(), 309 <<-, 167
write_file(), 309, 310 <=, 31
write_lines(), 309 =, 20
write_ods(), 316 ==, 31, 197
write_tsv(), 309 >, 31
xlab(), 257 >=, 31
xlim(), 239, 259, 260 [ , ], 197
xml_find_all(), 311 [ ], 64, 73
xml_text(), 311 [[ ]], 64
ylab(), 257 [[]], 62, 67, 70, 71
ylim(), 239, 259, 260 [], 70, 76
$, 68, 70, 71
names and their scope %*%, 56
attach(), 73–75 %.>%, 189, 194
detach(), 73–75, 165 %/%, 27
exists(), 176 %<>%, 189
with(), 73–75 %>%, 188, 189
within(), 73, 75 %T>%, 189
%%, 27
operators %in%, 37, 39
*, 18, 35 &, 29
+, 18, 28, 281 &&, 29
-, 18, 28 ^, 35
->, 20 |, 29
/, 18, 281 ||, 29
Alphabetic index of R names
*, 18, 35 ||, 29
+, 18, 28, 281
-, 18, 28 abs(), 28, 36
->, 20 aes(), 215, 231
-Inf, 25, 34 aggregate(), 195
.Machine$double.eps, 34 AIC(), 131
.Machine$double.max, 34 all(), 30
.Machine$double.min, 34 annotate(), 235, 270, 272
.Machine$double.neg.eps, 34 annotation_custom(), 236, 270
.Machine$integer.max, 35 anova(), 116, 131, 132, 135, 137,
/, 18, 281 138
:, 23 anti_join(), 200
<, 31 any(), 30
aov(), 135, 153
<-, 19, 20, 48, 73
append(), 22, 63
<<-, 167
apply(), 107, 108, 111, 112
<=, 31
arrange(), 194
=, 20
array, 51
==, 31, 197
array(), 54
>, 31
as.character(), 42, 58
>=, 31
as.data.frame(), 185
[ , ], 197
as.formula(), 148
[ ], 64, 73
as.integer(), 43
[[ ]], 64
as.logical(), 42
[[]], 62, 67, 70, 71
as.matrix(), 51
[], 70, 76 as.numeric(), 42, 43, 58
$, 68, 70, 71 as.ts(), 151
%*%, 56 as.vector(), 55
%.>%, 189, 194 as_tibble(), 183
%/%, 27 assign(), 113, 114, 167
%<>%, 189 attach(), 72–75
%>%, 188, 189 attr(), 77
%T>%, 189 attr()<-, 77
%%, 27 attributes(), 77, 127, 318
%in%, 37, 39 austres, 151
&, 29
&&, 29 basename(), 296
^, 35 BIC(), 131
|, 29 biplot(), 156
345
346 Alphabetic index of R names