0% found this document useful (0 votes)

36 views

Data Manipulation Workshop Handout

The document provides an introduction to data manipulation in R using the dplyr and tidyr packages. It is divided into three parts that cover: 1) basic data manipulation functions in dplyr like grouping, summarizing, filtering, selecting and mutating data, 2) reshaping data between wide and long formats using tidyr, and 3) joining two datasets. The document also provides resources for finding help with data manipulation tasks in R and includes practice problems for readers to apply the techniques.

Uploaded by

Alex

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views

Data Manipulation Workshop Handout

Uploaded by

Alex

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Data manipulation in R

Ariel Muldoon

March 2020

Contents
Introduction and background 3
Where to find help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Check package version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Load packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The mtcars dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Part 1: Functions for basic data manipulation 5

Calculating summary statistics by group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Using the group_by() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Using the summarise() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Summarizing multiple variables in summarise() . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Naming the variables in summarise() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Grouping a dataset by multiple variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Ungrouping a dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Summarizing multiple variables at once . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
summarise_all() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
summarise_at() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Summarizing many variables using multiple functions . . . . . . . . . . . . . . . . . . . . . . 10
The glimpse() function for examining wide datasets . . . . . . . . . . . . . . . . . . . . . . . 11
Filtering datasets with filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
Filtering grouped datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Filtering by multiple conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Scoped variants of filter() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Selecting variables with select() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
Using the special helper functions in select() . . . . . . . . . . . . . . . . . . . . . . . . . . 16
Creating new variables with mutate() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1
Using mutate() with grouped datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Combining data manipulation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Using temporary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Nesting functions to avoid temporary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
The pipe operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Combining data manipulation tasks using the pipe operator . . . . . . . . . . . . . . . 25
Using the pipe operator with non-dplyr functions . . . . . . . . . . . . . . . . . . . . 26
Counting the number of rows in a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Practice data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The babynames dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Practice problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Practice problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Part 2: Reshaping datasets 31

Wide to long with pivot_longer() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Long to wide with pivot_wider() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Using multiple columns in names_from . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Non-unique row identifiers in pivot_wider() . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Practice reshaping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Practice problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Part 3: Joining two datasets together 36

The inner join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
The left join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
The full join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Matching multiple rows when joining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
An example of when this is useful . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
An example of when this indicates a mistake . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Using anti_join() to find missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Using the join functions with the pipe operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Two additional dplyr functions 42

The n_distinct() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
The distinct() function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

2
Working through the practice problems 43
Answers problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Answer problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Answers problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Introduction and background

Today we are going to be learning how to perform basic data manipulation tasks in R. While there are many
options for tackling data manipulation problems in R (e.g., apply family, data.table package, functions
ave() and aggregate()), we will be working with the dplyr and tidyr packages today. I find that these
packages are approachable for people without a lot of programming background but are still quite fast when
working with large datasets.
In this workshop, we will cover the following:

• Making summary datasets by group

• Filtering the dataset to include only rows that satisfy certain conditions
• Selecting only some columns/variables in a dataset
• Adding new variables/columns
• Sorting datasets based on variables
• Reshaping datasets
• Merging or joining two datasets

The workshop is broken up into three parts:

In Part 1, we’ll review functions from dplyr for basic data manipulation/munging/cleaning. We
end with a chance for you to practice some of the functions we covered.

In Part 2, you’ll be introduced to the concept of reshaping datasets via tidyr functions. We’ll
do another practice exercise at the end of this section.

In Part 3 we’ll practice joining datasets using the join functions from dplyr.

Where to find help

It is important to know where to go for help when you run into data manipulation problems. The first
place to start is the help pages for the functions themselves; too often folks skip this step and end up in a
time-consuming search that could have been avoided. Another place that I often go to find help is on the
Stack Overflow website: http://stackoverflow.com/questions/tagged/r. I’ve given you the link to questions

3
that are specifically R programming questions. You could also look for questions tagged with dplyr or tidyr
or search all R-related questions using keywords or phrases.
The newer RStudio Community website, https://community.rstudio.com/, is another place to look for and
ask for help that can be less intimidating than Stack Overflow.
Both of these packages are fairly young, and while they are stabilizing, some elements of the packages may
still change. Functions we are using today, however, are functions that are already stable and likely won’t
change much through time.
Both packages have introductory vignettes that are useful.
The Introduction to dplyr vignette is updated as dplyr is updated, and is nice resource: https://cran.r-
project.org/web/packages/dplyr/vignettes/dplyr.html.
Also see the Tidy data vignette for some examples using tidyr: https://cran.r-project.org/web/packages/
tidyr/vignettes/tidy-data.html.
The RStudio cheat sheets may also be helpful: https://www.rstudio.com/resources/cheatsheets/

Getting started

Check package version

The current version of dplyr is 0.8.3 and the current version of tidyr is 1.0.2.
You can use packageVersion() to check for the currently installed version of a package. Make sure you
using current versions of both packages.

packageVersion("dplyr")

[1] '0.8.3'

packageVersion("tidyr")

[1] '1.0.2'

If one of these packages isn’t up to date, you need to re-install it. You can install via coding using, e.g.,
install.packages("tidyr") or via the RStudio Packages pane Install button. Remember that you do
not need to install a package every time you use it, so don’t make this code part of a script.
In between version releases, bugs are fixed and new issues addressed in the development version of a pack-
age. For these two packages, you can see the changes, check for known issues, and download the current
development version via their Github repositories. For dplyr see https://github.com/tidyverse/dplyr and
for tidyr see https://github.com/tidyverse/tidyr.

Load packages

If all packages are up-to-date we can load dplyr and tidyr and get started.

library(dplyr)
library(tidyr)

4
The mtcars dataset

In the first part of the workshop we will be using the mtcars dataset to practice data manipulation. This
dataset comes with R, and information about this dataset is available in the R help files for the dataset
(?mtcars).
We will be using both categorical and continuous variables from this dataset, including:
mpg (Miles per US gallon),
wt (car weight in 1000 lbs),
cyl (number of cylinders),
am (type of transmission),
disp (engine displacement),
qsec (quarter mile time), and
hp (horsepower).
Let’s take a quick look at the first six lines (with head()) and structure (with str()) of this dataset. You
should recognize that cyl and am (as well as others like vs) are categorical variables. However, they are
considered numeric variables in the dataset since the categories are expressed with numbers.

head(mtcars)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

str(mtcars)

'data.frame': 32 obs. of 11 variables:

Part 1: Functions for basic data manipulation

Calculating summary statistics by group

We’re going to start out today by learning how to calculate summary statistics by group. I start here because
this is common task that I see folks struggle with in R. The task of calculating summaries by groups in R is
referred to as a split-apply-combine task because we want to split the dataset into groups, apply a function
to each split, and then combine the results back into a single dataset.

5
There are a variety of ways to perform such tasks in R. We will be using dplyr functions in this workshop
but in the long run you may find you like the style of another method better.

Using the group_by() function

With dplyr, the key to split-apply-combine tasks is grouping. We need to define which variable contains the
groups that we want to summarize separately. We create a grouped dataset using the group_by() function.
Let’s create a grouped dataset named bycyl, where we group mtcars by the cyl variable. The cyl variable
is a categorical variable representing the number of cylinders a car has. This variable has 3 different levels,
4, 6, and 8.

bycyl = group_by(mtcars, cyl)

We can see that the new object is a grouped dataset if we print the head of the dataset and see the Groups
tag or see the class grouped_df in the object structure.

head(bycyl)

# A tibble: 6 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1

str(bycyl)

Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 32 obs. of 11 variables:

$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
- attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame': 3 obs. of 2 variables:
..$ cyl : num 4 6 8
..$ .rows:List of 3
.. ..$ : int 3 8 9 18 19 20 21 26 27 28 ...
.. ..$ : int 1 2 4 6 10 11 30
.. ..$ : int 5 7 12 13 14 15 16 17 22 23 ...
..- attr(*, ".drop")= logi TRUE

6
Using the summarise() function

Now that we have a grouped dataset, we can use it with the summarise() function to calculate summary
statistics by group. Note that summarize() is an alternative spelling for the same function.
We’ll start by calculating the mean engine displacement for each cylinder category. We will be working on
the grouped dataset bycyl since we want summaries by groups.
Notice that the first argument of summarise() is the dataset we want summarized. This is true for most
of the dplyr functions. We list the summary function and variable we want summarized as the second
argument.

summarise( bycyl, mean(disp) )

# A tibble: 3 x 2
cyl `mean(disp)`
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.

Notice that we printed the summarized dataset but did not name the resulting object. This is what we
will be doing for most of the workshop, as my goal is to show you what happens to the dataset after we
manipulate it. You certainly can (and likely will want to) name your final datasets. We’ll see some examples
of naming the new objects once we are doing multiple data manipulation tasks at one time.

Summarizing multiple variables in summarise()

We can summarize multiple variables or use different summary functions at once in summarise() by using
commas to separate each new function/variable.
For example, we can calculate the mean of engine displacement and horsepower by cylinder category in the
same function call.

summarise( bycyl, mean(disp), mean(hp) )

# A tibble: 3 x 3
cyl `mean(disp)` `mean(hp)`
<dbl> <dbl> <dbl>
1 4 105. 82.6
2 6 183. 122.
3 8 353. 209.

Naming the variables in summarise()

The default names for the new variables we’ve been calculating are sufficient for a quick summary but are
not particularly convenient if we wanted to use the result for anything further in R. We can set variable
names as we summarize.
Let’s calculate the mean and standard deviation of engine displacement by cylinder category and name the
new variables mdisp and sdisp, respectively.

7
summarise( bycyl, mdisp = mean(disp), sdisp = sd(disp) )

# A tibble: 3 x 3
cyl mdisp sdisp
<dbl> <dbl> <dbl>
1 4 105. 26.9
2 6 183. 41.6
3 8 353. 67.8

Grouping a dataset by multiple variables

Datasets can be grouped by multiple variables as well as by a single variable. This is common for studies
with multiple factors of interest or with nested studies designs (e.g., plots nested in transects nested in sites).
Let’s group mtcars by both cyl and am (transmission type) and then calculate the mean engine displacement.
In the output you can see we calculated mean engine displacement for every factor combination, for a total
of six rows (3 cyl categories and 2 am categories).

byam.cyl = group_by(mtcars, cyl, am)

summarise( byam.cyl, mdisp = mean(disp) )

# A tibble: 6 x 3
# Groups: cyl [3]
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326

Ungrouping a dataset

Looking at our last result, we can see the dataset is still grouped by the cyl variable (i.e., cyl is listed in
“Groups”). If we are finished with our data manipulation it is best practice to ungroup the dataset. Trying
to work with a dataset that is grouped when we don’t want it to be can lead to unusual behavior. It is
“safest” to make sure the final version of a dataset is ungrouped.
Ungrouping is done via the ungroup() function. Notice we no longer have any Groups listed in the output
once we do this, as the result is no longer grouped by any variables.

ungroup( summarise( byam.cyl, mdisp = mean(disp) ) )

# A tibble: 6 x 3
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326

8
Summarizing multiple variables at once

When we want to summarize many variables in a dataset using the same function, we can use one of
the scoped variants of summarise(). The scoped variants are summarise_all(), summarise_at(), and
summarise_if().
Note: These scoped functions will still be available but will be superseded by across() in dplyr
1.0.0, which will be released in 2020.

summarise_all()
The summarise_all() function is useful when we want to summarize every non-grouping variable in the
dataset with the same function. We give the function we want to use for the summaries as the second
argument, .funs.
Let’s see how summarise_all() works by calculating the mean of every variable in mtcars for each cylinder
category.

summarise_all(bycyl, .funs = mean)

# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5

Note that we need to be careful with summarise_all(). We could have problems if trying to summarize both
continuous and categorical variables in a single dataset and could end up with an error. All the variables
in mtcars are currently numeric. What would happen if we made one of the variables a factor and tried to
take the mean of every variable?

mtcars$vs = factor(mtcars$vs)

R still does the averaging, but returns NA and warning messages for the vs column.

summarise_all(bycyl, .funs = mean)

summarise_at()
We won’t always want to summarize every column in a dataset, for reasons including having a mix of variable
types. One option to only summarize some of the variables is to use summarise_at(), where we can list a
subset of the columns that we want summaries for by name in the .vars argument.
You can list the variables to summarize within vars().

9
summarise_at(bycyl, .vars = vars(disp, wt), .funs = mean)

# A tibble: 3 x 3
cyl disp wt
<dbl> <dbl> <dbl>
1 4 105. 2.29
2 6 183. 3.12
3 8 353. 4.00

We can also drop out the variables we don’t want summarized rather than writing out the ones we do want.
For example, while all the variables in mtcars are read as numeric, some are actually categorical. If we don’t
want to treat them as continuous, we can drop them from the summary. Let’s drop am and vs from our
summary. We can do this by using the minus sign with the variable names inside vars().
We will talk more about selecting and dropping specific variables later today when we talk about the
select() function.

summarise_at(bycyl, .vars = vars(-am, -vs), .funs = mean)

# A tibble: 3 x 9
cyl mpg disp hp drat wt qsec gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 3.29 3.5

####summarise_if()
If we want to choose the columns we want to summarize using a logical predicate function, we can use
summarise_if(). You can see on the help page that the predicate function is the second argument,
.predicate, followed by the summary functions.
Here, we’ll only summarize the numeric variables by using the predicate function is.numeric(). Using this,
R checks if a column is numeric with is.numeric() and if the result is TRUE a summary of the column is
made. If the result is FALSE, the variable is dropped from the output.
In this example, all variables except vs are numeric and will be summarized.

summarise_if(bycyl, .predicate = is.numeric, .funs = mean)

Summarizing many variables using multiple functions

If we want to summarize many variables with multiple functions, we pass all the functions we want to the
.funs argument in a list(). The functions are listed with commas between them.
For example, maybe we want to calculate both the mean and the maximum for all numeric variables by
group. The functions we use are mean() and max().

10
While the only example we see today is using summarise_if(), this can be done in any of the summarise_*
functions.

summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mean, max) )

# A tibble: 3 x 21
cyl mpg_fn1 disp_fn1 hp_fn1 drat_fn1 wt_fn1 qsec_fn1 vs_fn1 am_fn1 gear_fn1 carb_fn1 mpg_fn2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 33.9
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 21.4
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 19.2
# ... with 9 more variables: disp_fn2 <dbl>, hp_fn2 <dbl>, drat_fn2 <dbl>, wt_fn2 <dbl>,
# qsec_fn2 <dbl>, vs_fn2 <dbl>, am_fn2 <dbl>, gear_fn2 <dbl>, carb_fn2 <dbl>

Notice that we get fn1 and fn2 appended to the variable name when using multiple functions. To control
what name is appended you can assign names to each function within the list().

summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mn = mean, mx = max) )

# A tibble: 3 x 21
cyl mpg_mn disp_mn hp_mn drat_mn wt_mn qsec_mn vs_mn am_mn gear_mn carb_mn mpg_mx disp_mx hp_mx
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 33.9 147. 113
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 21.4 258 175
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 19.2 472 335
# ... with 7 more variables: drat_mx <dbl>, wt_mx <dbl>, qsec_mx <dbl>, vs_mx <dbl>, am_mx <dbl>,
# gear_mx <dbl>, carb_mx <dbl>

The glimpse() function for examining wide datasets

The dplyr package truncates how much of the dataset we see printed into the R Console. For very wide
datasets like the one we just created, we can get a better idea of what the result looks like using glimpse().

glimpse( summarise_if( bycyl, .predicate = is.numeric,

.funs = list(mn = mean, mx = max) ) )

Observations: 3
Variables: 21
$ cyl <dbl> 4, 6, 8
$ mpg_mn <dbl> 26.66364, 19.74286, 15.10000
$ disp_mn <dbl> 105.1364, 183.3143, 353.1000
$ hp_mn <dbl> 82.63636, 122.28571, 209.21429
$ drat_mn <dbl> 4.070909, 3.585714, 3.229286
$ wt_mn <dbl> 2.285727, 3.117143, 3.999214
$ qsec_mn <dbl> 19.13727, 17.97714, 16.77214
$ vs_mn <dbl> 0.9090909, 0.5714286, 0.0000000
$ am_mn <dbl> 0.7272727, 0.4285714, 0.1428571

11
$ gear_mn <dbl> 4.090909, 3.857143, 3.285714
$ carb_mn <dbl> 1.545455, 3.428571, 3.500000
$ mpg_mx <dbl> 33.9, 21.4, 19.2
$ disp_mx <dbl> 146.7, 258.0, 472.0
$ hp_mx <dbl> 113, 175, 335
$ drat_mx <dbl> 4.93, 3.92, 4.22
$ wt_mx <dbl> 3.190, 3.460, 5.424
$ qsec_mx <dbl> 22.90, 20.22, 18.00
$ vs_mx <dbl> 1, 1, 0
$ am_mx <dbl> 1, 1, 1
$ gear_mx <dbl> 5, 5, 5
$ carb_mx <dbl> 2, 6, 8

Filtering datasets with filter()

Now we will cover functions for other common data manipulation tasks, starting with filtering. Filtering is
about how many rows we want in the dataset, not about the number of columns. It involves making specific
subsets of your data by removing unwanted rows. Rows to keep are chosen based on logical conditions.
For example, maybe we want to focus on a subset of the dataset that only involves cars with automatic
transmissions. We can do this with the filter() function to filter the mtcars dataset to only those rows
where am is 0.
Like other dplyr functions, the dataset is the first argument in filter(). The subsequent arguments are
the conditions that the filtered dataset should meet. Here, the condition is that cars must have automatic
transmissions, or am == 0 (note the two equals signs).

filter(mtcars, am == 0)

mpg cyl disp hp drat wt qsec vs am gear carb

1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
9 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
10 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
11 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
12 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
13 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
16 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
17 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
18 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

The filter() function will always be used with logical operators such as == (testing for equality), != (testing
for inequality), < (less than), is.na (all NA values), !is.na (all values except NA), >= (greater than or equal
to), etc.

12
If we wanted to filter out all cars that weigh more than 4000 lbs (i.e., 4 1000 lbs), we can keep only the rows
where wt <= 4.

filter(mtcars, wt <= 4)

# A tibble: 28 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 18 more rows

Alternatively, we could achieve the same thing by choosing everything that is not greater than 4, !wt > 4.
The exclamation point, !, is the not operator.

filter(mtcars, !wt > 4)

Filtering grouped datasets

We can filter grouped datasets, and the condition will be applied separately to each group. For example,
maybe we want to keep only the rows where wt is greater than its cylinder category group mean.
Notice I switch to filtering the grouped dataset bycyl here.

filter( bycyl, wt > mean(wt) )

# A tibble: 13 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb

13
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
3 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
7 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
11 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
12 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
13 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2

Filtering by multiple conditions

And, of course, we can filter datasets by multiple conditions at once. If we wanted to filter the dataset to
only cars with automatic transmission (am == 0) and that have weights less than or equal to 4000 lbs (wt
<= 4), we can include both conditions in filter() separated by a comma.

filter(mtcars, am == 0, wt <= 4)

mpg cyl disp hp drat wt qsec vs am gear carb

1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
9 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
10 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
11 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
12 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
13 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
14 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
15 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

While we won’t see it today, if you need a logical OR statement you will need the | symbol, found on the
backslash key on your keyboard.

Scoped variants of filter()

The dplyr package has filter_all(), filter_at(), and filter_if() verbs available. These would be
useful if we wanted to apply the same filter to many columns of data.
These are often used in combination with the functions any_vars() or all_vars(). The “Examples” section
of the help page is a good place to start to see worked examples.

14
Selecting variables with select()

Keeping only a subset of the columns of a dataset is referred to as selecting variables. This might be for
organizational reasons, where an analysis is focused on only some of many variables and so we want to create
a dataset that contains only the variables of interest. Selecting is about how many columns we want to keep,
not how many rows we have.
The dplyr function select() makes selecting columns very easy to do. We can keep or drop variables by
name (although you can also use the index number) with straightforward code.
Let’s select only the cyl variable from mtcars (printing just the first rows to save space in this document).

select(mtcars, cyl)

# A tibble: 32 x 1
cyl
<dbl>
1 6
2 6
3 4
4 6
5 8
6 6
7 8
8 4
9 4
10 6
# ... with 22 more rows

If we want to keep all variables between (and including) cyl and vs, we indicate that with the colon, :.

select(mtcars, cyl:vs)

# A tibble: 32 x 7
cyl disp hp drat wt qsec vs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 160 110 3.9 2.62 16.5 0
2 6 160 110 3.9 2.88 17.0 0
3 4 108 93 3.85 2.32 18.6 1
4 6 258 110 3.08 3.22 19.4 1
5 8 360 175 3.15 3.44 17.0 0
6 6 225 105 2.76 3.46 20.2 1
7 8 360 245 3.21 3.57 15.8 0
8 4 147. 62 3.69 3.19 20 1
9 4 141. 95 3.92 3.15 22.9 1
10 6 168. 123 3.92 3.44 18.3 1
# ... with 22 more rows

If we want to keep only a few columns, we can separate the desired column names with a comma. Here we
select only cyl and vs.

15
select(mtcars, cyl, vs)

# A tibble: 32 x 2
cyl vs
<dbl> <fct>
1 6 0
2 6 0
3 4 1
4 6 1
5 8 0
6 6 1
7 8 0
8 4 1
9 4 1
10 6 1
# ... with 22 more rows

Using the special helper functions in select()

The select() function has several special functions to make variable selection even easier. See the help
page for select_helpers for a list of all of these (?select_helpers).
These special functions include starts_with(), contains(), and ends_with(), among others. Such func-
tions can be very useful if you have coded your variables names so that related variables contain the same
letters or numbers.
We are going to start with an example starts_with(), where we select all variables with names that start
with a lowercase d. Remember that R is case sensitive, so an uppercase D is different than a lowercase d.

select( mtcars, starts_with("d") )

# A tibble: 32 x 2
disp drat
<dbl> <dbl>
1 160 3.9
2 160 3.9
3 108 3.85
4 258 3.08
5 360 3.15
6 225 2.76
7 360 3.21
8 147. 3.69
9 141. 3.92
10 168. 3.92
# ... with 22 more rows

Or we could keep all variables that contain a lowercase a anywhere in the variable name.

select( mtcars, contains("a") )

# A tibble: 32 x 4
drat am gear carb

16
<dbl> <dbl> <dbl> <dbl>
1 3.9 1 4 4
2 3.9 1 4 4
3 3.85 1 4 1
4 3.08 0 3 1
5 3.15 0 3 2
6 2.76 0 3 1
7 3.21 0 3 4
8 3.69 0 4 2
9 3.92 0 4 2
10 3.92 0 4 4
# ... with 22 more rows

We’ve been choosing which variables we want to keep, but we could also choose which variables we want to
drop like we did with summarise_at() earlier. We drop variables using the minus sign (-).
Drop the gear variable.

select(mtcars, -gear)

# A tibble: 32 x 10
mpg cyl disp hp drat wt qsec vs am carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
# ... with 22 more rows

Drop both the gear and carb variables.

select(mtcars, -gear, -carb)

# A tibble: 32 x 9
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1
2 21 6 160 110 3.9 2.88 17.0 0 1
3 22.8 4 108 93 3.85 2.32 18.6 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0
6 18.1 6 225 105 2.76 3.46 20.2 1 0
7 14.3 8 360 245 3.21 3.57 15.8 0 0
8 24.4 4 147. 62 3.69 3.19 20 1 0
9 22.8 4 141. 95 3.92 3.15 22.9 1 0
10 19.2 6 168. 123 3.92 3.44 18.3 1 0
# ... with 22 more rows

17
Drop all variables between and including am and carb. Notice that parentheses are needed around the
variables to use - like this.

select( mtcars, -(am:carb) )

# A tibble: 32 x 8
mpg cyl disp hp drat wt qsec vs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 21 6 160 110 3.9 2.62 16.5 0
2 21 6 160 110 3.9 2.88 17.0 0
3 22.8 4 108 93 3.85 2.32 18.6 1
4 21.4 6 258 110 3.08 3.22 19.4 1
5 18.7 8 360 175 3.15 3.44 17.0 0
6 18.1 6 225 105 2.76 3.46 20.2 1
7 14.3 8 360 245 3.21 3.57 15.8 0
8 24.4 4 147. 62 3.69 3.19 20 1
9 22.8 4 141. 95 3.92 3.15 22.9 1
10 19.2 6 168. 123 3.92 3.44 18.3 1
# ... with 22 more rows

Drop variables that end with the letter “t”.

select( mtcars, -ends_with("t") )

# A tibble: 32 x 9
mpg cyl disp hp qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 16.5 0 1 4 4
2 21 6 160 110 17.0 0 1 4 4
3 22.8 4 108 93 18.6 1 1 4 1
4 21.4 6 258 110 19.4 1 0 3 1
5 18.7 8 360 175 17.0 0 0 3 2
6 18.1 6 225 105 20.2 1 0 3 1
7 14.3 8 360 245 15.8 0 0 3 4
8 24.4 4 147. 62 20 1 0 4 2
9 22.8 4 141. 95 22.9 1 0 4 2
10 19.2 6 168. 123 18.3 1 0 4 4
# ... with 22 more rows

The select_helpers can be used in other functions, as well. We would commonly use them in the scoped
*_at() functions like summarise_at() to help pick the variables to use within the function. The select()
function also has scoped variants available, select_all(), select_at(), and select_if().

Creating new variables with mutate()

In dplyr, we can use mutate() to create new variables and add them to the dataset as new columns. The
new variable is the same length as the current dataset; in other words, it has the same number of rows as
the original dataset. We will be making some new variables and adding them to mtcars to illustrate how
this works.
Let’s start by making a new variable called disp.hp, which is the sum of engine displacement (disp) and
horsepower (hp).
As with the other dplyr functions, the dataset is the first argument of mutate().

18
mutate(mtcars, disp.hp = disp + hp)

# A tibble: 32 x 12
mpg cyl disp hp drat wt qsec vs am gear carb disp.hp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 270
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 270
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 201
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 368
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 535
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 330
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 605
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 209.
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 236.
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 291.
# ... with 22 more rows

We can make multiple new variables at once, separating each new variable by a comma like we did in
summarise(). A handy feature of mutate() is that we can work directly with the new variables we’ve made
within the same function call. For example, we can first calculate disp.hp and then calculate a second
variable that is half of disp.hp (disp.hp divided by 2). We can create other variables, as well, so we’ll
create the ratio of qsec and wt while we’re at it.

mutate(mtcars,
disp.hp = disp + hp,
halfdh = disp.hp/2,
qw = qsec/wt)

# A tibble: 32 x 14
mpg cyl disp hp drat wt qsec vs am gear carb disp.hp halfdh qw
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 270 135 6.28
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 270 135 5.92
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 201 100. 8.02
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 368 184 6.05
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 535 268. 4.95
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 330 165 5.84
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 605 302. 4.44
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 209. 104. 6.27
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 236. 118. 7.27
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 291. 145. 5.32
# ... with 22 more rows

Using mutate() with grouped datasets

We can work with grouped datasets when using mutate(). This is useful when we want to add a column of
a summary statistic for each group to the existing dataset rather than making a summary dataset.
Let’s create and add a new variable that is the mean horsepower for each cylinder category. Each car within
a cylinder category will have the same value of mean horsepower, as mutate() always returns a new dataset
that is the same length as the original.
Since this is a grouped operation we’ll work with the grouped dataset bycyl we made earlier.

19
mutate( bycyl, mhp = mean(hp) )

# A tibble: 32 x 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb mhp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 122.
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 122.
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 82.6
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 122.
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 209.
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 122.
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 209.
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 82.6
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 82.6
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 122.
# ... with 22 more rows

As you can see, the code for mutate() resembles the code for summarise(). While we will not see examples
today, there are mutate_all()/mutate_at()/mutate_if() functions available that work much like the
scoped variants of the summarise() function we saw earlier today.
There is also a function called transmute(), which creates new variables that are the same length as the
current dataset like mutate() but only returns the new variables like summarise().

Sorting

There are some situations where you might want to sort your dataset by variables within the dataset. For
example, if we want to pull out the first observation in each group from a time series we might sort the
dataset first by time within group prior to filtering. We can sort datasets with dplyr using arrange().
Here we’ll start by sorting mtcars by cyl. By default we sort whatever variable we are sorting on from low
to high (ascending order).

arrange(mtcars, cyl)

# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# ... with 22 more rows

To sort datasets by variables in descending order (highest to lowest), we can use the minus sign (-) or the
function desc() (which is from dplyr).

20
arrange(mtcars, -cyl)

# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
# ... with 22 more rows

arrange( mtcars, desc(cyl) )

To sort variables only within groups, we sort by the grouping variable first and then the other sorting
variables. The arrange() function ignores group_by(); this is different than all the other dplyr verbs
we’ve learned today.
Here’s an example of within-group sorting, sorting each cylinder category from lowest to highest wt.

arrange(mtcars, cyl, wt)

mpg cyl disp hp drat wt qsec vs am gear carb

1 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
4 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
5 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
6 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
7 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
8 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
9 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

21
10 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
11 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
12 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
13 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
14 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
15 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
16 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
17 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
18 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
19 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
20 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
21 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
25 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
26 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
27 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
28 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
29 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
30 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
31 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
32 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

To sort by more variables, keep adding them in arrange(), separated by commas.

Combining data manipulation tasks

When working with our own datasets we’ll often want to do multiple data manipulation tasks in a row.
Now that we’ve learned how to do different kinds of data manipulation, let’s string multiple manipulations
together.
We are going to:

1. Filter the mtcars dataset to only those cars with automatic transmissions;

2. Create a new variable that is the ratio of engine displacement and horsepower;

3. Calculate the mean of this new variable separately for each cylinder category.

Using temporary objects

First we’ll do this one step at a time, creating a new named object for each step. As a reminder, we haven’t
been naming objects as we practiced the functions above but instead were only printing results to the R
Console. Now we’re actually naming each object. I use = for assignment but you can also use <-.
The extra pair of parentheses I’m using prints the object so we can see what happens at each step.

# Filter by automatic transmission

( filtcars = filter(mtcars, am == 0) )

mpg cyl disp hp drat wt qsec vs am gear carb

22
1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
9 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
10 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
11 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
12 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
13 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
16 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
17 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
18 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

# Create new variable in the filtered dataset

( ratio.cars = mutate( filtcars, hd.ratio = hp/disp) )

mpg cyl disp hp drat wt qsec vs am gear carb hd.ratio

1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 0.4263566
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 0.4861111
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 0.4666667
4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 0.6805556
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 0.4226312
6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 0.6747159
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 0.7338902
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 0.7338902
9 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 0.6526468
10 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 0.6526468
11 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 0.6526468
12 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 0.4343220
13 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 0.4673913
14 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 0.5227273
15 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 0.8076603
16 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 0.4716981
17 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 0.4934211
18 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 0.7000000
19 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 0.4375000

# Group by number of cylinders

grp.ratio = group_by(ratio.cars, cyl)

# Calculate mean of the new ratio variable by cylinder category

( sum.ratio = summarise( grp.ratio, mratio = mean(hd.ratio) ) )

# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>

23
1 4 0.635
2 6 0.590
3 8 0.554

The downside of this approach to multiple manipulations is that we had to make four objects when we really
just wanted the final sum.ratio object. We have to think of names for each object at each step and we end
up with a bunch of temporary objects in our R Environment.

Nesting functions to avoid temporary objects

An alternative to temporary objects is to nest all the functions together. This means we put one function
call within the next function call. Nesting allows us to avoid making any temporary objects but the resulting
code is a bit hard to read. The code from nested functions is read inside out, where the first thing we do is
also the most nested.
First, a simple example of nesting functions from work we did earlier, where we want to group the dataset by
cyl and am and then calculate the mean of disp. Here’s the same task via nesting. We put the group_by()
function call within summarise().

summarise( group_by(mtcars, cyl, am), mdisp = mean(disp) )

# A tibble: 6 x 3
# Groups: cyl [3]
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326

Now the more complicated example, where we combined the series of data manipulation tasks. Note how
the filter() is four functions deep in the code below.

( sum.ratio = summarise( group_by( mutate( filter(mtcars, am == 0),

hd.ratio = hp/disp),
cyl),
mratio = mean(hd.ratio) ) )

# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>
1 4 0.635
2 6 0.590
3 8 0.554

The pipe operator

Now that we are combining multiple data manipulation functions from dplyr, it’s time to talk about the pipe
operator. The pipe operator (%>%) represents a different coding style. The pipe allows us to perform a series

24
of data manipulation steps in a long chain while avoiding all those temporary objects or difficult-to-read
nested code.
In essence, the pipe operator pipes a dataset into a function as the first argument. One reason I’ve been
pointing out to you that the dplyr functions have the dataset as the first argument is that this is one of the
things that makes piping so easy with these functions.
You can think of the pipe as being pronounced “then”, which we’ll talk more about as we see some examples.
Using the pipe is a bit hard to picture when you are first introduced to it, but things should start to get
clearer once we see some code.
Let’s start with a simple example. Remember when we grouped mtcars by cyl earlier?

bycyl = group_by(mtcars, cyl)

We read even this simple code “inside out”. We see that we are grouping with group_by() and then if we
read inside the function we see the dataset we are going to group. Let’s write this same code using the pipe.

bycyl = mtcars %>% group_by(cyl)

The code with the pipe is read from left to right. We see we are working with the mtcars dataset and then
that we are grouping that dataset by cyl. The result is the same, but the code itself looks quite different.
Handily, we can keep piping through multiple functions in one long chain. Let’s group mtcars by cyl and
then calculate the mean disp of each group.
When working with pipes in a chain, it is standard to use a line break after each pipe with an indent for
each subsequent function.
Aside: Stylistically, including white space in your code improves code readability. Think of writing a sentence
without white space; it would be hard to read! Newer R users sometimes need to be reminded that white
space rationing is not in effect. :-D It might seem clunky at first, but including white space quickly becomes
natural and your code becomes much easier to read and understand.

mtcars %>%
group_by(cyl) %>%
summarise( mdisp = mean(disp) )

# A tibble: 3 x 2
cyl mdisp
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.

Again, the above code is read from left to right. We see we are going to work with mtcars, then we group
it by cyl, and then we calculate the mean disp of the grouped dataset. When you read it like this you can
see why we might pronounce %>% as then.

Combining data manipulation tasks using the pipe operator

Let’s go back to our combined data manipulation task we did a few minutes ago on mtcars and use piping
instead of temporary objects or nesting.

25
mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category

# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>
1 4 0.635
2 6 0.590
3 8 0.554

We didn’t assign a name to the final object. Let’s do that now.

sum.ratio = mtcars %>%

filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category

Using the pipe operator with non-dplyr functions

The pipe operator can be used with functions outside the dplyr package, as well. If the first argument of
the function is the dataset, the code looks exactly like what we’ve been doing. For example, we can use the
pipe with the head() function from base R and get the first 10 rows of mtcars. The first argument of the
head() function is the dataset.

mtcars %>%
head(n = 10)

mpg cyl disp hp drat wt qsec vs am gear carb

Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

If the first argument of a function is not the dataset, we need to use the dot, ., to represent the dataset
name in the function we are piping into. We can see this if we use the pipe operator with the t.test()
function, which doesn’t have data as the first argument.
Here we test for a difference in mean horsepower among transmission types based on the mtcars dataset.
The dataset is piped to the data argument with the ..

26
mtcars %>%
t.test(hp ~ am, data = .)

Welch Two Sample t-test

data: hp by am
t = 1.2662, df = 18.715, p-value = 0.221
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.87858 88.71259
sample estimates:
mean in group 0 mean in group 1
160.2632 126.8462

We generally wouldn’t use piping in such a simple case, though, as we would use the data argument in
t.test() directly. A more realistic example is if we wanted to filter the dataset before doing the test.
Let’s filter mtcars to cars weighing less than or equal to 4000 lbs and then test if mean horsepower is different
between transmission types.

mtcars %>%
filter(wt <= 4) %>%
t.test(hp ~ am, data = .)

Welch Two Sample t-test

data: hp by am
t = 0.76927, df = 19.747, p-value = 0.4508
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-35.68283 77.32385
sample estimates:
mean in group 0 mean in group 1
147.6667 126.8462

Counting the number of rows in a group

Before we move on, I want to talk about one more function. The dplyr package has a built-in function,
n(), for counting up the unique rows in a group. This is useful when making tables of summary statistics.

mtcars %>%
group_by(cyl) %>%
summarise( n = n(),
mdisp = mean(disp) )

# A tibble: 3 x 3
cyl n mdisp
<dbl> <int> <dbl>
1 4 11 105.
2 6 7 183.
3 8 14 353.

27
Other useful functions that are related to n() are count() and tally() which can tally up number of rows
per group in fewer steps. Take a look at the help page for those to see how they work.
This function can be used directly inside other functions, such as filter(), for removing rows based on the
group total count. I’ll keep on the rows of the dataset of cyl groups that have fewer than 10 observations.
It turns out that this is true only for the 6 group.
To show best practice here I’ll ungroup() at the end of the pipe chain.

mtcars %>%
group_by(cyl) %>%
filter(n() < 10) %>%
ungroup()

# A tibble: 7 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

The n() function can also be used when assigning index numbers within groups. I use this most often when
my rows within groups aren’t uniquely identified but I need them to be. This is especially useful if the group
sizes aren’t known or might vary.
In this example we’ll also select() just the first three columns so we can easily see the new index column
that we create. This column indexes from one to group size (n()) in each cylinder group.

mtcars %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()

# A tibble: 32 x 4
mpg cyl disp index
<dbl> <dbl> <dbl> <int>
1 21 6 160 1
2 21 6 160 2
3 22.8 4 108 1
4 21.4 6 258 3
5 18.7 8 360 1
6 18.1 6 225 4
7 14.3 8 360 2
8 24.4 4 147. 2
9 22.8 4 141. 3
10 19.2 6 168. 5
# ... with 22 more rows

We might want to add this index in based on the order of some variable in the dataset, not on the order the
dataset is when we read it in. This is a case for arrange().

28
Let’s add the index based on the order of disp within each cyl category. We arrange() prior to creating
the index variable.

mtcars %>%
arrange(cyl, disp) %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()

# A tibble: 32 x 4
mpg cyl disp index
<dbl> <dbl> <dbl> <int>
1 33.9 4 71.1 1
2 30.4 4 75.7 2
3 32.4 4 78.7 3
4 27.3 4 79 4
5 30.4 4 95.1 5
6 22.8 4 108 6
7 21.5 4 120. 7
8 26 4 120. 8
9 21.4 4 121 9
10 22.8 4 141. 10
# ... with 22 more rows

Practice data manipulation

So far we’ve covered a lot of material on data manipulation functions. Before going on to the next topic, I
want to take some time to allow you to practice using some of the functions we’ve seen so far. I’ve set up
two example problems below. Each example will take a different set of functions to solve.

The babynames dataset

We’ll be practicing using the babynames dataset. This can be found in package babynames. The current ver-
sion of this package is 1.0.0. If you do not have this package or it is not up to date, please install it. You can
do this with the RStudio Packages pane Install button, or run the code install.packages("babynames").

packageVersion("babynames")

[1] '1.0.0'

Once the package is installed, load the package.

library(babynames)

The help page for babynames gives us some basic information on the dataset.

?babynames

29
The babynames dataset contains data from the United States Social Security Administration on the number
and proportion of babies given a name each year from 1880 through 2017. Rare names (recorded less than
5 times) are excluded from the dataset in R. The annual proportion of babies given a name was calculated
separately for male and female babies (sex).
The dataset has five variables, shown below.

glimpse(babynames)

Observations: 1,924,665
Variables: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880...
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida", "Alice", "Bertha...
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258, 1226, 1156, 1063...
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.01616720, 0.01508119...

head(babynames)

# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162

Practice problem 1

The first practice problem involves filtering and sorting.

Which name was given to the largest number of babies in the year you were born?

Once you find the answer

How many babies were given that name in 2017?

You can check to see how I approached this problem below.

Practice problem 2

The second practice problem involves filtering, grouping, and then summarizing the number of rows per
group.

Calculate the total number of baby names for each level of the sex variable in the
year you were born and in 2017.

Hint: To use filter() with multiple values you’ll need %in% instead of ==. For example, if you wanted to
filter to the years 1980 and 2015 you’d use year %in% c(1980, 2015) for the condition in filter().
Here’s how I tackled this.

30
Part 2: Reshaping datasets
We are going to switch gears now and talk about how to reshape datasets.
In this section, we will learn to take the information from the columns of a dataset and put that information
on rows instead. This is an example of taking a wide dataset and making it long. We will also learn to
take information from the rows of a dataset and put that information into columns instead. In other words,
reshape a dataset from long to wide. None of this changes how much information we have, it just changes
how the information is stored.
We will be learning to reshape using the tidyr package.
The current language of the tidyr package involves pivoting. To pivot long means to take a wide format
dataset and transform it into a long dataset. To pivot wide means to take a long format dataset and make
it wide. We’ll see examples of these as we go along, which should help clear up any confusion with this new
terminology.
We’ll learn the basics of reshaping on what I call a toy dataset. A toy dataset is a set of fake data that we
make to practice functions on. Small toy datasets are handy when you are learning a new function or trying
to troubleshoot a data manipulation technique. We could use built-in datasets like mtcars, as well, but toy
datasets are conveniently very small.
The dataset that we will create, toy1, will have six rows and five columns.
The first column contains the levels of some treatment (trt).
The second contains the identifier of individuals the treatment was applied to (indiv). These identifiers are
repeated across treatments, so individual 1 in treatment a is different than individual 1 in treatment b. This
means the combination of treatment and individual is the unique identifier for each row.
The last three columns are some quantitative measurement taken at three different times (time1, time2,
and time3).
The shape of this toy dataset is one I commonly see for data from studies that take measurements through
time.
I’m not going to walk through this code, but below you can see how I create this dataset. If you are interested
in more information on how to get started simulating data in R, see my post here.
This dataset toy1 is in a wide format. It has 6 rows, and the quantitative values are stored in the 3 “time”
columns for a total of 18 values.

( toy1 = data.frame(indiv = rep(1:3, times = 2),

trt = rep( c("a", "b"), each = 3),
time1 = rnorm(n = 6),
time2 = rnorm(n = 6),
time3 = rnorm(n = 6) ) )

indiv trt time1 time2 time3

1 1 a 1.1681713 -0.2922123 0.7378330
2 2 a 0.7398182 0.8228491 0.3662675
3 3 a 0.0754499 -2.0914390 0.0476780
4 1 b -0.4065857 -1.8281772 0.9915824
5 2 b 0.5277365 0.1024573 -0.6951842
6 3 b 0.6208091 -0.7267705 -0.5225653

If we were going to analyze this dataset in R we would most likely need it to be in a long format. We
want to keep the two columns containing the identifying information (trt, indiv), have a single column
containing the information about the time of measurement (time1, time2, or time3), and a single column
containing the values of the quantitative measurement. To lengthen a dataset from wide to long we use the
pivot_longer() function.

31
Wide to long with pivot_longer()

The tidyr package was built to be used with pipes, and the dataset is the first argument for its functions.
In pivot_longer(), the first thing we do after defining the dataset we want to reshape is to list the columns
that contain values we want to be combined into a single column in cols. We can use the select_helpers
we learned earlier for this.
Once we pick the columns we are combining, we name the new “grouping” column that will contain the
names of the columns we are combining with names_to. I will name this new column time. Note that when
naming a column we need to use a string, meaning the name has to be in quotes.
Finally we need to name the new column of values using values_to. I’ll name this column measurement.
This is also done using a string.
We have the same amount of information in the newly long dataset below as we did in the wide dataset.
We have 18 values, now stored in a single column. We changed the shape of the dataset, not the underlying
data.

toy1 %>%
pivot_longer(cols = time1:time3,
names_to = "time",
values_to = "measurement")

# A tibble: 18 x 4
indiv trt time measurement
<int> <fct> <chr> <dbl>
1 1 a time1 1.17
2 1 a time2 -0.292
3 1 a time3 0.738
4 2 a time1 0.740
5 2 a time2 0.823
6 2 a time3 0.366
7 3 a time1 0.0754
8 3 a time2 -2.09
9 3 a time3 0.0477
10 1 b time1 -0.407
11 1 b time2 -1.83
12 1 b time3 0.992
13 2 b time1 0.528
14 2 b time2 0.102
15 2 b time3 -0.695
16 3 b time1 0.621
17 3 b time2 -0.727
18 3 b time3 -0.523

We’d better name this newly long-format object so we can use it in further examples. We’ll use this long
dataset to practice putting it back into wide format. This time I use starts_with() to choose the columns.

toy1long = toy1 %>%

pivot_longer(cols = starts_with("time"),
names_to = "time",
values_to = "measurement")

32
Long to wide with pivot_wider()

Now we can use the function pivot_wider() to widen the long dataset toy1long back to its original format.
You might want to do this if, for example, you were going to take a dataset from an analysis done in R to
graph in a program like SigmaPlot (which apparently often works best on wide datasets).
In the pivot_wider() function, we’ll use the pair of arguments names_from and values_from after defining
the dataset we want to reshape.
The names_from argument is where we list the column(s) that contains the values we will use as the new
column names. We are referring to an existing column, so this can be done with bare names (i.e., without
quotes around the variable names).
We list the column that contains the value(s) we will fill the new columns with using values_from.

toy1long %>%
pivot_wider(names_from = time,
values_from = measurement)

# A tibble: 6 x 5
indiv trt time1 time2 time3
<int> <fct> <dbl> <dbl> <dbl>
1 1 a 1.17 -0.292 0.738
2 2 a 0.740 0.823 0.366
3 3 a 0.0754 -2.09 0.0477
4 1 b -0.407 -1.83 0.992
5 2 b 0.528 0.102 -0.695
6 3 b 0.621 -0.727 -0.523

Using multiple columns in names_from

In some cases we’ll want to make a wide dataset with new column names based on multiple variables in the
long dataset. In that case we can pass multiple variable names to names_from.
By default, the new column names will have an underscore (_) in them separating the information from the
two variables. The new column names are based on the order the variables are listed in names_from.
Now we have a 3 row dataset with quantitative values stored in 6 columns: we still have our original 18
pieces of information.

toy1long %>%
pivot_wider(names_from = c(trt, time),
values_from = measurement)

# A tibble: 3 x 7
indiv a_time1 a_time2 a_time3 b_time1 b_time2 b_time3
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.17 -0.292 0.738 -0.407 -1.83 0.992
2 2 0.740 0.823 0.366 0.528 0.102 -0.695
3 3 0.0754 -2.09 0.0477 0.621 -0.727 -0.523

We can change the symbol used in the new column names with names_sep. Here I also change the new
column names by changing order the variables are listed in names_from.

33
toy1long %>%
pivot_wider(names_from = c(time, trt),
values_from = measurement,
names_sep = ".")

# A tibble: 3 x 7
indiv time1.a time2.a time3.a time1.b time2.b time3.b
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.17 -0.292 0.738 -0.407 -1.83 0.992
2 2 0.740 0.823 0.366 0.528 0.102 -0.695
3 3 0.0754 -2.09 0.0477 0.621 -0.727 -0.523

Non-unique row identifiers in pivot_wider()

If the rows of the long dataset aren’t uniquely identified when converting into a wide format you will get a
warning message from pivot_wider().
For example, if we were trying to spread toy1long but we only had the trt variable and not the indiv
variable our rows wouldn’t be uniquely identified. It is only the combination of trt, indiv, and time that
uniquely identifies a row.
Let’s remove indiv from the dataset using select().

toy1long %>%
select(-indiv)

# A tibble: 18 x 3
trt time measurement
<fct> <chr> <dbl>
1 a time1 1.17
2 a time2 -0.292
3 a time3 0.738
4 a time1 0.740
5 a time2 0.823
6 a time3 0.366
7 a time1 0.0754
8 a time2 -2.09
9 a time3 0.0477
10 b time1 -0.407
11 b time2 -1.83
12 b time3 0.992
13 b time1 0.528
14 b time2 0.102
15 b time3 -0.695
16 b time1 0.621
17 b time2 -0.727
18 b time3 -0.523

There are now multiple observations of each time for each trt category; our rows are not uniquely identified.
Let’s see what happens when we use pivot_wider() on this dataset without changing the code.
In particular, take a look at the warning messages. These messages contain useful information about what
is in the output and why. The output dataset looks pretty different than what we’ve seen before because all
3 values for each trt and time were kept but placed into lists.

34
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement)

Warning: Values in `measurement` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(measurement = list)` to suppress this warning.
* Use `values_fn = list(measurement = length)` to identify where the duplicates arise
* Use `values_fn = list(measurement = summary_fun)` to summarise duplicates

# A tibble: 2 x 4
trt time1 time2 time3
<fct> <list> <list> <list>
1 a <dbl [3]> <dbl [3]> <dbl [3]>
2 b <dbl [3]> <dbl [3]> <dbl [3]>

If we really want to widen the dataset without indiv, we most likely want to summarize over the values for
each trt and time. This can be done using the values_fun argument. This is what the message

• Use values_fn = list(measurement = summary_fun) to summarise duplicates

was telling us.

I’ll change the code to calculate the mean of the values in each trt and time using values_fn. When we
summarize over multiple values we do change the total number of values in the dataset. We now have only
6 quantitative values in the output.

toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement,
values_fn = list(measurement = mean) )

# A tibble: 2 x 4
trt time1 time2 time3
<fct> <dbl> <dbl> <dbl>
1 a 0.661 -0.520 0.384
2 b 0.247 -0.817 -0.0754

Practice reshaping

Before we move on to Part 3 of the workshop I want you take time to practice reshaping with pivot_wider()
and pivot_longer().
We will once again be working with the babynames dataset.

Practice problem 3

The third practice problem is based off of our work from practice problem 2. We calculated the total number
of baby names in the year we were born and in 2017 for each sex.
I didn’t name the final object, but I need to in order to use it in this problem. I’ll do that here, and print
the result so I remember what it looked like.

35
( numbaby_76_17 = babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup() )

# A tibble: 4 x 3
year sex n
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160

Using your summarized dataset from practice problem 2:

Reshape the dataset to a wide format. Make a dataset with a separate column for
each sex containing the number of baby names in a given year.

Now reshape the same dataset to different wide format. Make a dataset with a
separate column for each year containing the number of baby names in a given sex.

Finally, practice putting the dataset back in the original format.

Take the dataset that has sex as separate columns and put this back in the original
format.

You can see my approach here.

Part 3: Joining two datasets together

The last topic we are going to cover today is merging or joining. For a variety of reasons, we might have
data for a single analysis stored in separate datasets. Joining is the process of combining two datasets based
on matching values in the columns you are using as the unique identifiers. The unique identifier variables
are the variables in the dataset that tells the computer which rows in one dataset should be matched to the
rows in another dataset.
There is a merge() function in base R, but we will be using some of the join functions from dplyr today,
including inner_join(), left_join(), and full_join().
Let’s create two toy datasets to join together.
The first dataset (tojoin1) will contain counts of some species in three different treatment plots (treat)
within different sites (site).
The second dataset (tojoin2) will contain an environmental variable, measured on the same plots and sites
(elev). Both datasets are missing measurements from a treatment plot in site 3; the first dataset is missing
treatment “c” and the second dataset is missing treatment “a”.
The key to making a data.frame() like this is to make sure each variable is the same length as each other
variable.
If we set.seed() to the same number we’ll all get the same random numbers from rpois() and rgamma().

36
set.seed(16) # If I set the seed, we will all get the same random numbers

# This dataset is slightly unbalanced, as site 3

# doesn't have the "c" treatment count
( tojoin1 = data.frame(site = rep(1:3, each = 3, length.out = 8),
treat = rep(c("a", "b", "c"), length.out = 8),
count = rpois(8, 6) ) )

# This dataset is also slightly unbalanced,

# missing the elevation measurement from
# site 3 treatment "a"
( tojoin2 = data.frame(site = rep(1:3, length.out = 8),
treat = rep(c("b", "c", "a"), each = 3, length.out = 8),
elev = rgamma(8, 1000, 1) ) )

The unique identifier of each measurement in each dataset is a combination of site and treat; those are
the variables that we will use to tell R which rows within the two datasets to combine into one.

The inner join

Let’s start our joining practice by joining these two datasets together using inner_join().
See the help page, ?join, to see a description of each type of join available in dplyr. In the documentation,
you will see that every join involves two datasets, called x and y, to be joined. The x dataset is the first
dataset you give to the join function and the y dataset is the second.
An inner join matches on the unique identifiers and returns only rows that are shared in both datasets.
From the documentation, an inner_join() will

return all rows from x where there are matching values in y, and all columns from x and y

37
By default, inner_join() joins on all columns shared by the two datasets. When we use this default, we
will get a message telling us which variables were used for joining when we run the code.
We’ll name our new combined dataset joined, and print the result.

( joined = inner_join(tojoin1, tojoin2) )

Joining, by = c("site", "treat")

site treat count elev

1 1 a 7 1003.0419
2 1 b 4 1036.1179
3 1 c 6 977.3154
4 2 a 4 976.0533
5 2 b 9 984.7461
6 2 c 5 1017.7019
7 3 b 8 1012.7026

To make our code more explicit and easily understandable, we can also use the by argument to define which
variables we want to join on. This is what I usually do.

inner_join( tojoin1, tojoin2, by = c("site", "treat") )

site treat count elev

1 1 a 7 1003.0419
2 1 b 4 1036.1179
3 1 c 6 977.3154
4 2 a 4 976.0533
5 2 b 9 984.7461
6 2 c 5 1017.7019
7 3 b 8 1012.7026

We see above that the joined dataset only has 7 rows. This is because there are only 7 site-treatment
combinations that are present in both datasets. If we want to retain more rows, we’ll need a different kind
of join.

The left join

A left join is used when we want to keep all rows in the first dataset regardless of if they have a match in
the second dataset.
From the documentation, left_join() will

return all rows from x, and all columns from x and y. Rows in x with no match in y will have
NA values in the new columns.

So in our scenario, we should get 8 rows back because we have 8 rows in the first dataset (tojoin1). We
will still be missing a row for site 3 treatment “a”, as this is not present in the first dataset.

There is also a right_join(), which we won’t practice today but works a lot like the left_join().

The full join

To keep all rows in both datasets regardless of a match, we can make a full join via full_join().
The full_join() function will

return all rows and all columns from both x and y. Where there are not matching values, returns
NA for the one missing.

This is how we can get rows for all nine site-treatment combinations.

full_join( tojoin1, tojoin2, by = c("site", "treat") )

Matching multiple rows when joining

There is an additional sentence in the documentation when describing the joins that we haven’t discussed
yet.

If there are multiple matches between x and y, all combination of the matches are returned.

This is an important topic to cover, as sometimes we want this behavior but other times this behavior helps
us uncover a mistake we are making.

An example of when this is useful

For example, if we wanted to join a variable that was only measured at the “site” level, this behavior is
desirable. Let’s make a dataset that has a variable measured at the site level.

39
# A site level variable, the amount of rainfall
( tojoin3 = data.frame(site = 1:3,
rainfall = rgamma(3, 10, 1) ) )

site rainfall
1 1 16.43203
2 2 9.30625
3 3 11.35799

This new dataset only has 3 rows. Every treatment plot in the count dataset tojoin1 needs to be assigned
to the same value of the rainfall site-level variable. This means each row in the site-level dataset will be
matched to multiple rows in tojoin1 when joined, which is exactly what we want to happen. We end up
with an 8 row dataset.

left_join(tojoin1, tojoin3, by = "site")

An example of when this indicates a mistake

This sort of behavior can cause unexpected results, though. If we join the original two joining datasets using
only site instead of the two variables that make up the unique identifier of each row, we will end up with
multiple matches per row. This leads us with a dataset that is much longer than expected.
When this sort of thing happens unexpectedly, we likely need to step back and evaluate whether or not we
have unique identifiers. We may need to rethink what we are doing versus what we want the final dataset
to look like.

40
Using anti_join() to find missing data
The very last function we’ll learn today is yet another kind of join, called the anti_join().
An anti_join() will

return all rows from x where there are not matching values in y, keeping just columns from x.

This is great for figuring out which rows are missing matches between two datasets. In an anti-join, we want
to only return the values in the x dataset that are not in the y dataset.
Both anti_join() and the related semi_join() act more like filters than joins.
Here’s what this looks like, pulling out the row in tojoin1 that is missing from tojoin2. We see we are
missing treatment “a” at site 3 from tojoin2.

anti_join( tojoin1, tojoin2, by = c("site", "treat") )

site treat count

1 3 a 3

If we wanted to find the row in tojoin2 that is missing in tojoin1, we switch the order we put the datasets
in anti_join(). Now we see where are missing treatment “c” at site 3 from tojoin1.

anti_join( tojoin2, tojoin1, by = c("site", "treat") )

site treat elev

1 3 c 1058.751

Using the join functions with the pipe operator

The join functions can be used with the pipe operator. We can only pipe in one dataset at a time, so we
have to decide if we want to pipe the dataset in as the x dataset or the y dataset.
Piping in a join function isn’t super useful for these simple examples I’m showing you, but we can easily fit
a join into a longer pipe chain.
If piping a dataset in as the x dataset, the piped-in dataset is the first argument of whatever join function
you are using. This example uses the anti_join().

tojoin1 %>%
anti_join( tojoin2, by = c("site", "treat") )

site treat count

1 3 a 3

We can pipe the dataset as the y dataset, as well, using the . placeholder we saw earlier.

tojoin1 %>%
anti_join( tojoin2, ., by = c("site", "treat") )

site treat elev

1 3 c 1058.751

Joins are an important skill to learn for data manipulation. The main take-home message here is that joins
can be used as part of a longer chain of data manipulation steps via the pipe.

41
Two additional dplyr functions
There are a couple other functions I use for data checking a lot, which I will put here at the end of the
workshop. We may not get to these during the workshop and so I have listed them here for reference.

The n_distinct() function

The n_distinct() function is a useful function for counting up the number of unique values of a variable. I
use this most when I’m learning about a dataset that I don’t know well and want to understand the structure
of individual variables.
I also use n_distinct() when I think I have mistakes in a variable, such as a value of a categorical variable
being misentered. For example, if we know our dataset should only have 3 values for cyl we can check to
make sure our variable doesn’t contain more than that with n_distinct().

mtcars %>%
summarise( ncyl = n_distinct(cyl) )

ncyl
1 3

Another example is checking how many unique values of one variable is in each group. Here we’ll calculate
how many unique values of mpg there are in each cylinder category with n_distinct() and compare that
to the number of rows we have in that category calculated with n().

mtcars %>%
group_by(cyl) %>%
summarise( nmpg = n_distinct(mpg),
n = n() )

# A tibble: 3 x 3
cyl nmpg n
<dbl> <int> <int>
1 4 9 11
2 6 6 7
3 8 12 14

There are fewer unique mpg values (only 27) than there are rows in the dataset.

The distinct() function

The last of the dplyr functions we will see is the distinct() function. This is the function we can use if
we need to remove duplicate-valued rows from our dataset.
For example, we saw above that we had fewer unique values of mpg in each cyl group than we had rows in
the dataset. Let’s pull out only the distinct values of mpg per cyl group.
The resulting dataset has 27 rows instead of the 32 rows of the original dataset. These are the rows that
contain the first of each unique value of mpg within each cylinder category.

42
mtcars %>%
group_by(cyl) %>%
distinct(mpg) %>%
ungroup()

# A tibble: 27 x 2
mpg cyl
<dbl> <dbl>
1 21 6
2 22.8 4
3 21.4 6
4 18.7 8
5 18.1 6
6 14.3 8
7 24.4 4
8 19.2 6
9 17.8 6
10 16.4 8
# ... with 17 more rows

Above we only kept the grouping variables and the variable we used to determine uniqueness. If we want to
keep all the variables when using distinct() we need the .keep_all argument.

mtcars %>%
group_by(cyl) %>%
distinct(mpg, .keep_all = TRUE) %>%
ungroup()

# A tibble: 27 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
10 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# ... with 17 more rows

Working through the practice problems

You can see how I approached each practice problem below.

Answers problem 1

Which name was given to the largest number of babies in the year you were born?

43
I was born in 1976, so I first filter the dataset to that year. Since I wanted to find the name given to the
largest number of babies I then sorted by n in descending order.
The most babies were named Michael in 1976.

babynames %>%
filter(year == 1976) %>%
arrange(-n)

# A tibble: 17,391 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1976 M Michael 66964 0.0410
2 1976 F Jennifer 59474 0.0378
3 1976 M Jason 52681 0.0323
4 1976 M Christopher 45213 0.0277
5 1976 M David 39299 0.0241
6 1976 M James 38306 0.0235
7 1976 M John 34008 0.0208
8 1976 M Robert 33809 0.0207
9 1976 F Amy 31341 0.0199
10 1976 M Brian 30535 0.0187
# ... with 17,381 more rows

How many babies were given that name in 2017?

I need to filter the dataset to 2017 and babies named Michael. Notice I didn’t specify sex, and there were
both male and female babies named Michael in 2017.

babynames %>%
filter(year == 2017, name == "Michael")

# A tibble: 2 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2017 F Michael 33 0.0000176
2 2017 M Michael 12579 0.00641

Go to practice problem 2

Answer problem 2
Calculate the total number of baby names for each level of the sex variable in the
year you were born and in 2017.

We didn’t see how to filter to multiple values, so we’ll need to make sure of the hint to do so.
I’ll first filter the dataset to only the years 1976 and 2017. Then I’ll group it by both year and sex so I can
add up the total number of rows in each year for each sex with summarise() and n(). This works because
each row in the babynames dataset is a unique name.
I end with ungroup() to make sure the final result isn’t grouped anymore.
There are more different baby names in 2017 compared to 1976 and in both years there were more unique
names for female babies compared to male babies.

44
babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup()

# A tibble: 4 x 3
year sex n
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160

Go to part 2 of the workshop

Answers problem 3

These questions are based on the final dataset from practice problem 2. My first step was to name this
object so I can use it to answer the question in problem 3.

numbaby_76_17 = babynames %>%

filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup()

Reshape the dataset from practice problem 2 to a wide format. Make a dataset with
a separate column for each sex containing the number of baby names in a given year.

Since we’re going from long to wide we’ll need pivot_wider(). The question specifically asks for separate
columns by sex, which tells me that this variable should be listed in names_from. The n variable is what I
need to fill the columns with so I use it as the values_from variable.

numbaby_76_17 %>%
pivot_wider(names_from = sex,
values_from = n)

# A tibble: 2 x 3
year F M
<dbl> <int> <int>
1 1976 10900 6491
2 2017 18309 14160

Now reshape the same dataset to different wide format. Make a dataset with a
separate column for each year containing the number of baby names in a given sex.

This is very similar to the first question except this time year is the names_from variable. Notice that
the result has backticks around the new column names, since having column names as numbers is not
syntactically valid in R.

45
numbaby_76_17 %>%
pivot_wider(names_from = year,
values_from = n)

# A tibble: 2 x 3
sex `1976` `2017`
<chr> <int> <int>
1 F 10900 18309
2 M 6491 14160

Take the dataset that has sex as separate columns and put this back in the original
format.

Since we are now going from ’wide" to “long” this involves using pivot_longer(). The two columns that
contain information I want to gather are F and M. I’ll call the new categorical column "sex" and the new
continuous column "num_name".

numbaby_76_17 %>%
pivot_wider(names_from = sex,
values_from = n) %>%
pivot_longer(cols = F:M,
names_to = "sex",
values_to = "num_name")

# A tibble: 4 x 3
year sex num_name
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160