Data Manipulation Workshop Handout
Data Manipulation Workshop Handout
Ariel Muldoon
March 2020
Contents
Introduction and background 3
Where to find help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Getting started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Check package version . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Load packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
The mtcars dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1
Using mutate() with grouped datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Combining data manipulation tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Using temporary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Nesting functions to avoid temporary objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
The pipe operator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Combining data manipulation tasks using the pipe operator . . . . . . . . . . . . . . . 25
Using the pipe operator with non-dplyr functions . . . . . . . . . . . . . . . . . . . . 26
Counting the number of rows in a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Practice data manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
The babynames dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Practice problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Practice problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2
Working through the practice problems 43
Answers problem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Answer problem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Answers problem 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Today we are going to be learning how to perform basic data manipulation tasks in R. While there are many
options for tackling data manipulation problems in R (e.g., apply family, data.table package, functions
ave() and aggregate()), we will be working with the dplyr and tidyr packages today. I find that these
packages are approachable for people without a lot of programming background but are still quite fast when
working with large datasets.
In this workshop, we will cover the following:
In Part 1, we’ll review functions from dplyr for basic data manipulation/munging/cleaning. We
end with a chance for you to practice some of the functions we covered.
In Part 2, you’ll be introduced to the concept of reshaping datasets via tidyr functions. We’ll
do another practice exercise at the end of this section.
In Part 3 we’ll practice joining datasets using the join functions from dplyr.
It is important to know where to go for help when you run into data manipulation problems. The first
place to start is the help pages for the functions themselves; too often folks skip this step and end up in a
time-consuming search that could have been avoided. Another place that I often go to find help is on the
Stack Overflow website: http://stackoverflow.com/questions/tagged/r. I’ve given you the link to questions
3
that are specifically R programming questions. You could also look for questions tagged with dplyr or tidyr
or search all R-related questions using keywords or phrases.
The newer RStudio Community website, https://community.rstudio.com/, is another place to look for and
ask for help that can be less intimidating than Stack Overflow.
Both of these packages are fairly young, and while they are stabilizing, some elements of the packages may
still change. Functions we are using today, however, are functions that are already stable and likely won’t
change much through time.
Both packages have introductory vignettes that are useful.
The Introduction to dplyr vignette is updated as dplyr is updated, and is nice resource: https://cran.r-
project.org/web/packages/dplyr/vignettes/dplyr.html.
Also see the Tidy data vignette for some examples using tidyr: https://cran.r-project.org/web/packages/
tidyr/vignettes/tidy-data.html.
The RStudio cheat sheets may also be helpful: https://www.rstudio.com/resources/cheatsheets/
Getting started
The current version of dplyr is 0.8.3 and the current version of tidyr is 1.0.2.
You can use packageVersion() to check for the currently installed version of a package. Make sure you
using current versions of both packages.
packageVersion("dplyr")
[1] '0.8.3'
packageVersion("tidyr")
[1] '1.0.2'
If one of these packages isn’t up to date, you need to re-install it. You can install via coding using, e.g.,
install.packages("tidyr") or via the RStudio Packages pane Install button. Remember that you do
not need to install a package every time you use it, so don’t make this code part of a script.
In between version releases, bugs are fixed and new issues addressed in the development version of a pack-
age. For these two packages, you can see the changes, check for known issues, and download the current
development version via their Github repositories. For dplyr see https://github.com/tidyverse/dplyr and
for tidyr see https://github.com/tidyverse/tidyr.
Load packages
If all packages are up-to-date we can load dplyr and tidyr and get started.
library(dplyr)
library(tidyr)
4
The mtcars dataset
In the first part of the workshop we will be using the mtcars dataset to practice data manipulation. This
dataset comes with R, and information about this dataset is available in the R help files for the dataset
(?mtcars).
We will be using both categorical and continuous variables from this dataset, including:
mpg (Miles per US gallon),
wt (car weight in 1000 lbs),
cyl (number of cylinders),
am (type of transmission),
disp (engine displacement),
qsec (quarter mile time), and
hp (horsepower).
Let’s take a quick look at the first six lines (with head()) and structure (with str()) of this dataset. You
should recognize that cyl and am (as well as others like vs) are categorical variables. However, they are
considered numeric variables in the dataset since the categories are expressed with numbers.
head(mtcars)
str(mtcars)
We’re going to start out today by learning how to calculate summary statistics by group. I start here because
this is common task that I see folks struggle with in R. The task of calculating summaries by groups in R is
referred to as a split-apply-combine task because we want to split the dataset into groups, apply a function
to each split, and then combine the results back into a single dataset.
5
There are a variety of ways to perform such tasks in R. We will be using dplyr functions in this workshop
but in the long run you may find you like the style of another method better.
With dplyr, the key to split-apply-combine tasks is grouping. We need to define which variable contains the
groups that we want to summarize separately. We create a grouped dataset using the group_by() function.
Let’s create a grouped dataset named bycyl, where we group mtcars by the cyl variable. The cyl variable
is a categorical variable representing the number of cylinders a car has. This variable has 3 different levels,
4, 6, and 8.
We can see that the new object is a grouped dataset if we print the head of the dataset and see the Groups
tag or see the class grouped_df in the object structure.
head(bycyl)
# A tibble: 6 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
str(bycyl)
6
Using the summarise() function
Now that we have a grouped dataset, we can use it with the summarise() function to calculate summary
statistics by group. Note that summarize() is an alternative spelling for the same function.
We’ll start by calculating the mean engine displacement for each cylinder category. We will be working on
the grouped dataset bycyl since we want summaries by groups.
Notice that the first argument of summarise() is the dataset we want summarized. This is true for most
of the dplyr functions. We list the summary function and variable we want summarized as the second
argument.
# A tibble: 3 x 2
cyl `mean(disp)`
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Notice that we printed the summarized dataset but did not name the resulting object. This is what we
will be doing for most of the workshop, as my goal is to show you what happens to the dataset after we
manipulate it. You certainly can (and likely will want to) name your final datasets. We’ll see some examples
of naming the new objects once we are doing multiple data manipulation tasks at one time.
We can summarize multiple variables or use different summary functions at once in summarise() by using
commas to separate each new function/variable.
For example, we can calculate the mean of engine displacement and horsepower by cylinder category in the
same function call.
# A tibble: 3 x 3
cyl `mean(disp)` `mean(hp)`
<dbl> <dbl> <dbl>
1 4 105. 82.6
2 6 183. 122.
3 8 353. 209.
The default names for the new variables we’ve been calculating are sufficient for a quick summary but are
not particularly convenient if we wanted to use the result for anything further in R. We can set variable
names as we summarize.
Let’s calculate the mean and standard deviation of engine displacement by cylinder category and name the
new variables mdisp and sdisp, respectively.
7
summarise( bycyl, mdisp = mean(disp), sdisp = sd(disp) )
# A tibble: 3 x 3
cyl mdisp sdisp
<dbl> <dbl> <dbl>
1 4 105. 26.9
2 6 183. 41.6
3 8 353. 67.8
Datasets can be grouped by multiple variables as well as by a single variable. This is common for studies
with multiple factors of interest or with nested studies designs (e.g., plots nested in transects nested in sites).
Let’s group mtcars by both cyl and am (transmission type) and then calculate the mean engine displacement.
In the output you can see we calculated mean engine displacement for every factor combination, for a total
of six rows (3 cyl categories and 2 am categories).
# A tibble: 6 x 3
# Groups: cyl [3]
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326
Ungrouping a dataset
Looking at our last result, we can see the dataset is still grouped by the cyl variable (i.e., cyl is listed in
“Groups”). If we are finished with our data manipulation it is best practice to ungroup the dataset. Trying
to work with a dataset that is grouped when we don’t want it to be can lead to unusual behavior. It is
“safest” to make sure the final version of a dataset is ungrouped.
Ungrouping is done via the ungroup() function. Notice we no longer have any Groups listed in the output
once we do this, as the result is no longer grouped by any variables.
# A tibble: 6 x 3
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326
8
Summarizing multiple variables at once
When we want to summarize many variables in a dataset using the same function, we can use one of
the scoped variants of summarise(). The scoped variants are summarise_all(), summarise_at(), and
summarise_if().
Note: These scoped functions will still be available but will be superseded by across() in dplyr
1.0.0, which will be released in 2020.
summarise_all()
The summarise_all() function is useful when we want to summarize every non-grouping variable in the
dataset with the same function. We give the function we want to use for the summaries as the second
argument, .funs.
Let’s see how summarise_all() works by calculating the mean of every variable in mtcars for each cylinder
category.
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
Note that we need to be careful with summarise_all(). We could have problems if trying to summarize both
continuous and categorical variables in a single dataset and could end up with an error. All the variables
in mtcars are currently numeric. What would happen if we made one of the variables a factor and tried to
take the mean of every variable?
mtcars$vs = factor(mtcars$vs)
R still does the averaging, but returns NA and warning messages for the vs column.
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
summarise_at()
We won’t always want to summarize every column in a dataset, for reasons including having a mix of variable
types. One option to only summarize some of the variables is to use summarise_at(), where we can list a
subset of the columns that we want summaries for by name in the .vars argument.
You can list the variables to summarize within vars().
9
summarise_at(bycyl, .vars = vars(disp, wt), .funs = mean)
# A tibble: 3 x 3
cyl disp wt
<dbl> <dbl> <dbl>
1 4 105. 2.29
2 6 183. 3.12
3 8 353. 4.00
We can also drop out the variables we don’t want summarized rather than writing out the ones we do want.
For example, while all the variables in mtcars are read as numeric, some are actually categorical. If we don’t
want to treat them as continuous, we can drop them from the summary. Let’s drop am and vs from our
summary. We can do this by using the minus sign with the variable names inside vars().
We will talk more about selecting and dropping specific variables later today when we talk about the
select() function.
# A tibble: 3 x 9
cyl mpg disp hp drat wt qsec gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 3.29 3.5
####summarise_if()
If we want to choose the columns we want to summarize using a logical predicate function, we can use
summarise_if(). You can see on the help page that the predicate function is the second argument,
.predicate, followed by the summary functions.
Here, we’ll only summarize the numeric variables by using the predicate function is.numeric(). Using this,
R checks if a column is numeric with is.numeric() and if the result is TRUE a summary of the column is
made. If the result is FALSE, the variable is dropped from the output.
In this example, all variables except vs are numeric and will be summarized.
# A tibble: 3 x 11
cyl mpg disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5
If we want to summarize many variables with multiple functions, we pass all the functions we want to the
.funs argument in a list(). The functions are listed with commas between them.
For example, maybe we want to calculate both the mean and the maximum for all numeric variables by
group. The functions we use are mean() and max().
10
While the only example we see today is using summarise_if(), this can be done in any of the summarise_*
functions.
summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mean, max) )
# A tibble: 3 x 21
cyl mpg_fn1 disp_fn1 hp_fn1 drat_fn1 wt_fn1 qsec_fn1 vs_fn1 am_fn1 gear_fn1 carb_fn1 mpg_fn2
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 33.9
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 21.4
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 19.2
# ... with 9 more variables: disp_fn2 <dbl>, hp_fn2 <dbl>, drat_fn2 <dbl>, wt_fn2 <dbl>,
# qsec_fn2 <dbl>, vs_fn2 <dbl>, am_fn2 <dbl>, gear_fn2 <dbl>, carb_fn2 <dbl>
Notice that we get fn1 and fn2 appended to the variable name when using multiple functions. To control
what name is appended you can assign names to each function within the list().
summarise_if( bycyl,
.predicate = is.numeric,
.funs = list(mn = mean, mx = max) )
# A tibble: 3 x 21
cyl mpg_mn disp_mn hp_mn drat_mn wt_mn qsec_mn vs_mn am_mn gear_mn carb_mn mpg_mx disp_mx hp_mx
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 4 26.7 105. 82.6 4.07 2.29 19.1 0.909 0.727 4.09 1.55 33.9 147. 113
2 6 19.7 183. 122. 3.59 3.12 18.0 0.571 0.429 3.86 3.43 21.4 258 175
3 8 15.1 353. 209. 3.23 4.00 16.8 0 0.143 3.29 3.5 19.2 472 335
# ... with 7 more variables: drat_mx <dbl>, wt_mx <dbl>, qsec_mx <dbl>, vs_mx <dbl>, am_mx <dbl>,
# gear_mx <dbl>, carb_mx <dbl>
The dplyr package truncates how much of the dataset we see printed into the R Console. For very wide
datasets like the one we just created, we can get a better idea of what the result looks like using glimpse().
Observations: 3
Variables: 21
$ cyl <dbl> 4, 6, 8
$ mpg_mn <dbl> 26.66364, 19.74286, 15.10000
$ disp_mn <dbl> 105.1364, 183.3143, 353.1000
$ hp_mn <dbl> 82.63636, 122.28571, 209.21429
$ drat_mn <dbl> 4.070909, 3.585714, 3.229286
$ wt_mn <dbl> 2.285727, 3.117143, 3.999214
$ qsec_mn <dbl> 19.13727, 17.97714, 16.77214
$ vs_mn <dbl> 0.9090909, 0.5714286, 0.0000000
$ am_mn <dbl> 0.7272727, 0.4285714, 0.1428571
11
$ gear_mn <dbl> 4.090909, 3.857143, 3.285714
$ carb_mn <dbl> 1.545455, 3.428571, 3.500000
$ mpg_mx <dbl> 33.9, 21.4, 19.2
$ disp_mx <dbl> 146.7, 258.0, 472.0
$ hp_mx <dbl> 113, 175, 335
$ drat_mx <dbl> 4.93, 3.92, 4.22
$ wt_mx <dbl> 3.190, 3.460, 5.424
$ qsec_mx <dbl> 22.90, 20.22, 18.00
$ vs_mx <dbl> 1, 1, 0
$ am_mx <dbl> 1, 1, 1
$ gear_mx <dbl> 5, 5, 5
$ carb_mx <dbl> 2, 6, 8
Now we will cover functions for other common data manipulation tasks, starting with filtering. Filtering is
about how many rows we want in the dataset, not about the number of columns. It involves making specific
subsets of your data by removing unwanted rows. Rows to keep are chosen based on logical conditions.
For example, maybe we want to focus on a subset of the dataset that only involves cars with automatic
transmissions. We can do this with the filter() function to filter the mtcars dataset to only those rows
where am is 0.
Like other dplyr functions, the dataset is the first argument in filter(). The subsequent arguments are
the conditions that the filtered dataset should meet. Here, the condition is that cars must have automatic
transmissions, or am == 0 (note the two equals signs).
filter(mtcars, am == 0)
The filter() function will always be used with logical operators such as == (testing for equality), != (testing
for inequality), < (less than), is.na (all NA values), !is.na (all values except NA), >= (greater than or equal
to), etc.
12
If we wanted to filter out all cars that weigh more than 4000 lbs (i.e., 4 1000 lbs), we can keep only the rows
where wt <= 4.
filter(mtcars, wt <= 4)
# A tibble: 28 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 18 more rows
Alternatively, we could achieve the same thing by choosing everything that is not greater than 4, !wt > 4.
The exclamation point, !, is the not operator.
# A tibble: 28 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ... with 18 more rows
We can filter grouped datasets, and the condition will be applied separately to each group. For example,
maybe we want to keep only the rows where wt is greater than its cylinder category group mean.
Notice I switch to filtering the grouped dataset bycyl here.
# A tibble: 13 x 11
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb
13
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
3 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
4 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
5 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
6 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
7 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
8 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
9 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
10 10.4 8 460 215 3 5.42 17.8 0 0 3 4
11 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
12 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
13 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
And, of course, we can filter datasets by multiple conditions at once. If we wanted to filter the dataset to
only cars with automatic transmission (am == 0) and that have weights less than or equal to 4000 lbs (wt
<= 4), we can include both conditions in filter() separated by a comma.
filter(mtcars, am == 0, wt <= 4)
While we won’t see it today, if you need a logical OR statement you will need the | symbol, found on the
backslash key on your keyboard.
The dplyr package has filter_all(), filter_at(), and filter_if() verbs available. These would be
useful if we wanted to apply the same filter to many columns of data.
These are often used in combination with the functions any_vars() or all_vars(). The “Examples” section
of the help page is a good place to start to see worked examples.
14
Selecting variables with select()
Keeping only a subset of the columns of a dataset is referred to as selecting variables. This might be for
organizational reasons, where an analysis is focused on only some of many variables and so we want to create
a dataset that contains only the variables of interest. Selecting is about how many columns we want to keep,
not how many rows we have.
The dplyr function select() makes selecting columns very easy to do. We can keep or drop variables by
name (although you can also use the index number) with straightforward code.
Let’s select only the cyl variable from mtcars (printing just the first rows to save space in this document).
select(mtcars, cyl)
# A tibble: 32 x 1
cyl
<dbl>
1 6
2 6
3 4
4 6
5 8
6 6
7 8
8 4
9 4
10 6
# ... with 22 more rows
If we want to keep all variables between (and including) cyl and vs, we indicate that with the colon, :.
select(mtcars, cyl:vs)
# A tibble: 32 x 7
cyl disp hp drat wt qsec vs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 6 160 110 3.9 2.62 16.5 0
2 6 160 110 3.9 2.88 17.0 0
3 4 108 93 3.85 2.32 18.6 1
4 6 258 110 3.08 3.22 19.4 1
5 8 360 175 3.15 3.44 17.0 0
6 6 225 105 2.76 3.46 20.2 1
7 8 360 245 3.21 3.57 15.8 0
8 4 147. 62 3.69 3.19 20 1
9 4 141. 95 3.92 3.15 22.9 1
10 6 168. 123 3.92 3.44 18.3 1
# ... with 22 more rows
If we want to keep only a few columns, we can separate the desired column names with a comma. Here we
select only cyl and vs.
15
select(mtcars, cyl, vs)
# A tibble: 32 x 2
cyl vs
<dbl> <fct>
1 6 0
2 6 0
3 4 1
4 6 1
5 8 0
6 6 1
7 8 0
8 4 1
9 4 1
10 6 1
# ... with 22 more rows
The select() function has several special functions to make variable selection even easier. See the help
page for select_helpers for a list of all of these (?select_helpers).
These special functions include starts_with(), contains(), and ends_with(), among others. Such func-
tions can be very useful if you have coded your variables names so that related variables contain the same
letters or numbers.
We are going to start with an example starts_with(), where we select all variables with names that start
with a lowercase d. Remember that R is case sensitive, so an uppercase D is different than a lowercase d.
# A tibble: 32 x 2
disp drat
<dbl> <dbl>
1 160 3.9
2 160 3.9
3 108 3.85
4 258 3.08
5 360 3.15
6 225 2.76
7 360 3.21
8 147. 3.69
9 141. 3.92
10 168. 3.92
# ... with 22 more rows
Or we could keep all variables that contain a lowercase a anywhere in the variable name.
# A tibble: 32 x 4
drat am gear carb
16
<dbl> <dbl> <dbl> <dbl>
1 3.9 1 4 4
2 3.9 1 4 4
3 3.85 1 4 1
4 3.08 0 3 1
5 3.15 0 3 2
6 2.76 0 3 1
7 3.21 0 3 4
8 3.69 0 4 2
9 3.92 0 4 2
10 3.92 0 4 4
# ... with 22 more rows
We’ve been choosing which variables we want to keep, but we could also choose which variables we want to
drop like we did with summarise_at() earlier. We drop variables using the minus sign (-).
Drop the gear variable.
select(mtcars, -gear)
# A tibble: 32 x 10
mpg cyl disp hp drat wt qsec vs am carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4
3 22.8 4 108 93 3.85 2.32 18.6 1 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0 1
5 18.7 8 360 175 3.15 3.44 17.0 0 0 2
6 18.1 6 225 105 2.76 3.46 20.2 1 0 1
7 14.3 8 360 245 3.21 3.57 15.8 0 0 4
8 24.4 4 147. 62 3.69 3.19 20 1 0 2
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 2
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
# ... with 22 more rows
# A tibble: 32 x 9
mpg cyl disp hp drat wt qsec vs am
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1
2 21 6 160 110 3.9 2.88 17.0 0 1
3 22.8 4 108 93 3.85 2.32 18.6 1 1
4 21.4 6 258 110 3.08 3.22 19.4 1 0
5 18.7 8 360 175 3.15 3.44 17.0 0 0
6 18.1 6 225 105 2.76 3.46 20.2 1 0
7 14.3 8 360 245 3.21 3.57 15.8 0 0
8 24.4 4 147. 62 3.69 3.19 20 1 0
9 22.8 4 141. 95 3.92 3.15 22.9 1 0
10 19.2 6 168. 123 3.92 3.44 18.3 1 0
# ... with 22 more rows
17
Drop all variables between and including am and carb. Notice that parentheses are needed around the
variables to use - like this.
# A tibble: 32 x 8
mpg cyl disp hp drat wt qsec vs
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
1 21 6 160 110 3.9 2.62 16.5 0
2 21 6 160 110 3.9 2.88 17.0 0
3 22.8 4 108 93 3.85 2.32 18.6 1
4 21.4 6 258 110 3.08 3.22 19.4 1
5 18.7 8 360 175 3.15 3.44 17.0 0
6 18.1 6 225 105 2.76 3.46 20.2 1
7 14.3 8 360 245 3.21 3.57 15.8 0
8 24.4 4 147. 62 3.69 3.19 20 1
9 22.8 4 141. 95 3.92 3.15 22.9 1
10 19.2 6 168. 123 3.92 3.44 18.3 1
# ... with 22 more rows
# A tibble: 32 x 9
mpg cyl disp hp qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 16.5 0 1 4 4
2 21 6 160 110 17.0 0 1 4 4
3 22.8 4 108 93 18.6 1 1 4 1
4 21.4 6 258 110 19.4 1 0 3 1
5 18.7 8 360 175 17.0 0 0 3 2
6 18.1 6 225 105 20.2 1 0 3 1
7 14.3 8 360 245 15.8 0 0 3 4
8 24.4 4 147. 62 20 1 0 4 2
9 22.8 4 141. 95 22.9 1 0 4 2
10 19.2 6 168. 123 18.3 1 0 4 4
# ... with 22 more rows
The select_helpers can be used in other functions, as well. We would commonly use them in the scoped
*_at() functions like summarise_at() to help pick the variables to use within the function. The select()
function also has scoped variants available, select_all(), select_at(), and select_if().
18
mutate(mtcars, disp.hp = disp + hp)
# A tibble: 32 x 12
mpg cyl disp hp drat wt qsec vs am gear carb disp.hp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 270
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 270
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 201
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 368
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 535
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 330
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 605
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 209.
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 236.
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 291.
# ... with 22 more rows
We can make multiple new variables at once, separating each new variable by a comma like we did in
summarise(). A handy feature of mutate() is that we can work directly with the new variables we’ve made
within the same function call. For example, we can first calculate disp.hp and then calculate a second
variable that is half of disp.hp (disp.hp divided by 2). We can create other variables, as well, so we’ll
create the ratio of qsec and wt while we’re at it.
mutate(mtcars,
disp.hp = disp + hp,
halfdh = disp.hp/2,
qw = qsec/wt)
# A tibble: 32 x 14
mpg cyl disp hp drat wt qsec vs am gear carb disp.hp halfdh qw
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 270 135 6.28
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 270 135 5.92
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 201 100. 8.02
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 368 184 6.05
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 535 268. 4.95
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 330 165 5.84
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 605 302. 4.44
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 209. 104. 6.27
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 236. 118. 7.27
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 291. 145. 5.32
# ... with 22 more rows
We can work with grouped datasets when using mutate(). This is useful when we want to add a column of
a summary statistic for each group to the existing dataset rather than making a summary dataset.
Let’s create and add a new variable that is the mean horsepower for each cylinder category. Each car within
a cylinder category will have the same value of mean horsepower, as mutate() always returns a new dataset
that is the same length as the original.
Since this is a grouped operation we’ll work with the grouped dataset bycyl we made earlier.
19
mutate( bycyl, mhp = mean(hp) )
# A tibble: 32 x 12
# Groups: cyl [3]
mpg cyl disp hp drat wt qsec vs am gear carb mhp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 122.
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 122.
3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 82.6
4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 122.
5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 209.
6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 122.
7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 209.
8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 82.6
9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 82.6
10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 122.
# ... with 22 more rows
As you can see, the code for mutate() resembles the code for summarise(). While we will not see examples
today, there are mutate_all()/mutate_at()/mutate_if() functions available that work much like the
scoped variants of the summarise() function we saw earlier today.
There is also a function called transmute(), which creates new variables that are the same length as the
current dataset like mutate() but only returns the new variables like summarise().
Sorting
There are some situations where you might want to sort your dataset by variables within the dataset. For
example, if we want to pull out the first observation in each group from a time series we might sort the
dataset first by time within group prior to filtering. We can sort datasets with dplyr using arrange().
Here we’ll start by sorting mtcars by cyl. By default we sort whatever variable we are sorting on from low
to high (ascending order).
arrange(mtcars, cyl)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
3 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
4 32.4 4 78.7 66 4.08 2.2 19.5 1 1 4 1
5 30.4 4 75.7 52 4.93 1.62 18.5 1 1 4 2
6 33.9 4 71.1 65 4.22 1.84 19.9 1 1 4 1
7 21.5 4 120. 97 3.7 2.46 20.0 1 0 3 1
8 27.3 4 79 66 4.08 1.94 18.9 1 1 4 1
9 26 4 120. 91 4.43 2.14 16.7 0 1 5 2
10 30.4 4 95.1 113 3.77 1.51 16.9 1 1 5 2
# ... with 22 more rows
To sort datasets by variables in descending order (highest to lowest), we can use the minus sign (-) or the
function desc() (which is from dplyr).
20
arrange(mtcars, -cyl)
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
# ... with 22 more rows
# A tibble: 32 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
2 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
3 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
4 17.3 8 276. 180 3.07 3.73 17.6 0 0 3 3
5 15.2 8 276. 180 3.07 3.78 18 0 0 3 3
6 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4
7 10.4 8 460 215 3 5.42 17.8 0 0 3 4
8 14.7 8 440 230 3.23 5.34 17.4 0 0 3 4
9 15.5 8 318 150 2.76 3.52 16.9 0 0 3 2
10 15.2 8 304 150 3.15 3.44 17.3 0 0 3 2
# ... with 22 more rows
To sort variables only within groups, we sort by the grouping variable first and then the other sorting
variables. The arrange() function ignores group_by(); this is different than all the other dplyr verbs
we’ve learned today.
Here’s an example of within-group sorting, sorting each cylinder category from lowest to highest wt.
21
10 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
11 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
12 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
13 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
14 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
15 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
16 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
17 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
18 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
19 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
20 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
21 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
22 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
23 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
24 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
25 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
26 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
27 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
28 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
29 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
30 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
31 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
32 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
When working with our own datasets we’ll often want to do multiple data manipulation tasks in a row.
Now that we’ve learned how to do different kinds of data manipulation, let’s string multiple manipulations
together.
We are going to:
1. Filter the mtcars dataset to only those cars with automatic transmissions;
2. Create a new variable that is the ratio of engine displacement and horsepower;
3. Calculate the mean of this new variable separately for each cylinder category.
First we’ll do this one step at a time, creating a new named object for each step. As a reminder, we haven’t
been naming objects as we practiced the functions above but instead were only printing results to the R
Console. Now we’re actually naming each object. I use = for assignment but you can also use <-.
The extra pair of parentheses I’m using prints the object so we can see what happens at each step.
22
1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
3 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
5 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
6 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
7 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
8 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
9 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
10 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
11 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
12 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
13 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
14 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
15 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
16 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
17 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
18 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
19 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>
23
1 4 0.635
2 6 0.590
3 8 0.554
The downside of this approach to multiple manipulations is that we had to make four objects when we really
just wanted the final sum.ratio object. We have to think of names for each object at each step and we end
up with a bunch of temporary objects in our R Environment.
An alternative to temporary objects is to nest all the functions together. This means we put one function
call within the next function call. Nesting allows us to avoid making any temporary objects but the resulting
code is a bit hard to read. The code from nested functions is read inside out, where the first thing we do is
also the most nested.
First, a simple example of nesting functions from work we did earlier, where we want to group the dataset by
cyl and am and then calculate the mean of disp. Here’s the same task via nesting. We put the group_by()
function call within summarise().
# A tibble: 6 x 3
# Groups: cyl [3]
cyl am mdisp
<dbl> <dbl> <dbl>
1 4 0 136.
2 4 1 93.6
3 6 0 205.
4 6 1 155
5 8 0 358.
6 8 1 326
Now the more complicated example, where we combined the series of data manipulation tasks. Note how
the filter() is four functions deep in the code below.
# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>
1 4 0.635
2 6 0.590
3 8 0.554
Now that we are combining multiple data manipulation functions from dplyr, it’s time to talk about the pipe
operator. The pipe operator (%>%) represents a different coding style. The pipe allows us to perform a series
24
of data manipulation steps in a long chain while avoiding all those temporary objects or difficult-to-read
nested code.
In essence, the pipe operator pipes a dataset into a function as the first argument. One reason I’ve been
pointing out to you that the dplyr functions have the dataset as the first argument is that this is one of the
things that makes piping so easy with these functions.
You can think of the pipe as being pronounced “then”, which we’ll talk more about as we see some examples.
Using the pipe is a bit hard to picture when you are first introduced to it, but things should start to get
clearer once we see some code.
Let’s start with a simple example. Remember when we grouped mtcars by cyl earlier?
We read even this simple code “inside out”. We see that we are grouping with group_by() and then if we
read inside the function we see the dataset we are going to group. Let’s write this same code using the pipe.
The code with the pipe is read from left to right. We see we are working with the mtcars dataset and then
that we are grouping that dataset by cyl. The result is the same, but the code itself looks quite different.
Handily, we can keep piping through multiple functions in one long chain. Let’s group mtcars by cyl and
then calculate the mean disp of each group.
When working with pipes in a chain, it is standard to use a line break after each pipe with an indent for
each subsequent function.
Aside: Stylistically, including white space in your code improves code readability. Think of writing a sentence
without white space; it would be hard to read! Newer R users sometimes need to be reminded that white
space rationing is not in effect. :-D It might seem clunky at first, but including white space quickly becomes
natural and your code becomes much easier to read and understand.
mtcars %>%
group_by(cyl) %>%
summarise( mdisp = mean(disp) )
# A tibble: 3 x 2
cyl mdisp
<dbl> <dbl>
1 4 105.
2 6 183.
3 8 353.
Again, the above code is read from left to right. We see we are going to work with mtcars, then we group
it by cyl, and then we calculate the mean disp of the grouped dataset. When you read it like this you can
see why we might pronounce %>% as then.
25
mtcars %>%
filter(am == 0) %>% # filter out the manual transmission cars
mutate(hd.ratio = hp/disp) %>% # make new ratio variable
group_by(cyl) %>% # group by number of cylinders
summarise(mratio = mean(hd.ratio) ) # calculate mean hd.ratio per cylinder category
# A tibble: 3 x 2
cyl mratio
<dbl> <dbl>
1 4 0.635
2 6 0.590
3 8 0.554
mtcars %>%
head(n = 10)
If the first argument of a function is not the dataset, we need to use the dot, ., to represent the dataset
name in the function we are piping into. We can see this if we use the pipe operator with the t.test()
function, which doesn’t have data as the first argument.
Here we test for a difference in mean horsepower among transmission types based on the mtcars dataset.
The dataset is piped to the data argument with the ..
26
mtcars %>%
t.test(hp ~ am, data = .)
data: hp by am
t = 1.2662, df = 18.715, p-value = 0.221
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-21.87858 88.71259
sample estimates:
mean in group 0 mean in group 1
160.2632 126.8462
We generally wouldn’t use piping in such a simple case, though, as we would use the data argument in
t.test() directly. A more realistic example is if we wanted to filter the dataset before doing the test.
Let’s filter mtcars to cars weighing less than or equal to 4000 lbs and then test if mean horsepower is different
between transmission types.
mtcars %>%
filter(wt <= 4) %>%
t.test(hp ~ am, data = .)
data: hp by am
t = 0.76927, df = 19.747, p-value = 0.4508
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-35.68283 77.32385
sample estimates:
mean in group 0 mean in group 1
147.6667 126.8462
mtcars %>%
group_by(cyl) %>%
summarise( n = n(),
mdisp = mean(disp) )
# A tibble: 3 x 3
cyl n mdisp
<dbl> <int> <dbl>
1 4 11 105.
2 6 7 183.
3 8 14 353.
27
Other useful functions that are related to n() are count() and tally() which can tally up number of rows
per group in fewer steps. Take a look at the help page for those to see how they work.
This function can be used directly inside other functions, such as filter(), for removing rows based on the
group total count. I’ll keep on the rows of the dataset of cyl groups that have fewer than 10 observations.
It turns out that this is true only for the 6 group.
To show best practice here I’ll ungroup() at the end of the pipe chain.
mtcars %>%
group_by(cyl) %>%
filter(n() < 10) %>%
ungroup()
# A tibble: 7 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
5 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
6 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
7 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
The n() function can also be used when assigning index numbers within groups. I use this most often when
my rows within groups aren’t uniquely identified but I need them to be. This is especially useful if the group
sizes aren’t known or might vary.
In this example we’ll also select() just the first three columns so we can easily see the new index column
that we create. This column indexes from one to group size (n()) in each cylinder group.
mtcars %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
# A tibble: 32 x 4
mpg cyl disp index
<dbl> <dbl> <dbl> <int>
1 21 6 160 1
2 21 6 160 2
3 22.8 4 108 1
4 21.4 6 258 3
5 18.7 8 360 1
6 18.1 6 225 4
7 14.3 8 360 2
8 24.4 4 147. 2
9 22.8 4 141. 3
10 19.2 6 168. 5
# ... with 22 more rows
We might want to add this index in based on the order of some variable in the dataset, not on the order the
dataset is when we read it in. This is a case for arrange().
28
Let’s add the index based on the order of disp within each cyl category. We arrange() prior to creating
the index variable.
mtcars %>%
arrange(cyl, disp) %>%
group_by(cyl) %>%
select(1:3) %>%
mutate( index = 1:n() ) %>%
ungroup()
# A tibble: 32 x 4
mpg cyl disp index
<dbl> <dbl> <dbl> <int>
1 33.9 4 71.1 1
2 30.4 4 75.7 2
3 32.4 4 78.7 3
4 27.3 4 79 4
5 30.4 4 95.1 5
6 22.8 4 108 6
7 21.5 4 120. 7
8 26 4 120. 8
9 21.4 4 121 9
10 22.8 4 141. 10
# ... with 22 more rows
So far we’ve covered a lot of material on data manipulation functions. Before going on to the next topic, I
want to take some time to allow you to practice using some of the functions we’ve seen so far. I’ve set up
two example problems below. Each example will take a different set of functions to solve.
We’ll be practicing using the babynames dataset. This can be found in package babynames. The current ver-
sion of this package is 1.0.0. If you do not have this package or it is not up to date, please install it. You can
do this with the RStudio Packages pane Install button, or run the code install.packages("babynames").
packageVersion("babynames")
[1] '1.0.0'
library(babynames)
The help page for babynames gives us some basic information on the dataset.
?babynames
29
The babynames dataset contains data from the United States Social Security Administration on the number
and proportion of babies given a name each year from 1880 through 2017. Rare names (recorded less than
5 times) are excluded from the dataset in R. The annual proportion of babies given a name was calculated
separately for male and female babies (sex).
The dataset has five variables, shown below.
glimpse(babynames)
Observations: 1,924,665
Variables: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880...
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "Ida", "Alice", "Bertha...
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1258, 1226, 1156, 1063...
$ prop <dbl> 0.07238359, 0.02667896, 0.02052149, 0.01986579, 0.01788843, 0.01616720, 0.01508119...
head(babynames)
# A tibble: 6 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
Practice problem 1
Which name was given to the largest number of babies in the year you were born?
Practice problem 2
The second practice problem involves filtering, grouping, and then summarizing the number of rows per
group.
Calculate the total number of baby names for each level of the sex variable in the
year you were born and in 2017.
Hint: To use filter() with multiple values you’ll need %in% instead of ==. For example, if you wanted to
filter to the years 1980 and 2015 you’d use year %in% c(1980, 2015) for the condition in filter().
Here’s how I tackled this.
30
Part 2: Reshaping datasets
We are going to switch gears now and talk about how to reshape datasets.
In this section, we will learn to take the information from the columns of a dataset and put that information
on rows instead. This is an example of taking a wide dataset and making it long. We will also learn to
take information from the rows of a dataset and put that information into columns instead. In other words,
reshape a dataset from long to wide. None of this changes how much information we have, it just changes
how the information is stored.
We will be learning to reshape using the tidyr package.
The current language of the tidyr package involves pivoting. To pivot long means to take a wide format
dataset and transform it into a long dataset. To pivot wide means to take a long format dataset and make
it wide. We’ll see examples of these as we go along, which should help clear up any confusion with this new
terminology.
We’ll learn the basics of reshaping on what I call a toy dataset. A toy dataset is a set of fake data that we
make to practice functions on. Small toy datasets are handy when you are learning a new function or trying
to troubleshoot a data manipulation technique. We could use built-in datasets like mtcars, as well, but toy
datasets are conveniently very small.
The dataset that we will create, toy1, will have six rows and five columns.
The first column contains the levels of some treatment (trt).
The second contains the identifier of individuals the treatment was applied to (indiv). These identifiers are
repeated across treatments, so individual 1 in treatment a is different than individual 1 in treatment b. This
means the combination of treatment and individual is the unique identifier for each row.
The last three columns are some quantitative measurement taken at three different times (time1, time2,
and time3).
The shape of this toy dataset is one I commonly see for data from studies that take measurements through
time.
I’m not going to walk through this code, but below you can see how I create this dataset. If you are interested
in more information on how to get started simulating data in R, see my post here.
This dataset toy1 is in a wide format. It has 6 rows, and the quantitative values are stored in the 3 “time”
columns for a total of 18 values.
If we were going to analyze this dataset in R we would most likely need it to be in a long format. We
want to keep the two columns containing the identifying information (trt, indiv), have a single column
containing the information about the time of measurement (time1, time2, or time3), and a single column
containing the values of the quantitative measurement. To lengthen a dataset from wide to long we use the
pivot_longer() function.
31
Wide to long with pivot_longer()
The tidyr package was built to be used with pipes, and the dataset is the first argument for its functions.
In pivot_longer(), the first thing we do after defining the dataset we want to reshape is to list the columns
that contain values we want to be combined into a single column in cols. We can use the select_helpers
we learned earlier for this.
Once we pick the columns we are combining, we name the new “grouping” column that will contain the
names of the columns we are combining with names_to. I will name this new column time. Note that when
naming a column we need to use a string, meaning the name has to be in quotes.
Finally we need to name the new column of values using values_to. I’ll name this column measurement.
This is also done using a string.
We have the same amount of information in the newly long dataset below as we did in the wide dataset.
We have 18 values, now stored in a single column. We changed the shape of the dataset, not the underlying
data.
toy1 %>%
pivot_longer(cols = time1:time3,
names_to = "time",
values_to = "measurement")
# A tibble: 18 x 4
indiv trt time measurement
<int> <fct> <chr> <dbl>
1 1 a time1 1.17
2 1 a time2 -0.292
3 1 a time3 0.738
4 2 a time1 0.740
5 2 a time2 0.823
6 2 a time3 0.366
7 3 a time1 0.0754
8 3 a time2 -2.09
9 3 a time3 0.0477
10 1 b time1 -0.407
11 1 b time2 -1.83
12 1 b time3 0.992
13 2 b time1 0.528
14 2 b time2 0.102
15 2 b time3 -0.695
16 3 b time1 0.621
17 3 b time2 -0.727
18 3 b time3 -0.523
We’d better name this newly long-format object so we can use it in further examples. We’ll use this long
dataset to practice putting it back into wide format. This time I use starts_with() to choose the columns.
32
Long to wide with pivot_wider()
Now we can use the function pivot_wider() to widen the long dataset toy1long back to its original format.
You might want to do this if, for example, you were going to take a dataset from an analysis done in R to
graph in a program like SigmaPlot (which apparently often works best on wide datasets).
In the pivot_wider() function, we’ll use the pair of arguments names_from and values_from after defining
the dataset we want to reshape.
The names_from argument is where we list the column(s) that contains the values we will use as the new
column names. We are referring to an existing column, so this can be done with bare names (i.e., without
quotes around the variable names).
We list the column that contains the value(s) we will fill the new columns with using values_from.
toy1long %>%
pivot_wider(names_from = time,
values_from = measurement)
# A tibble: 6 x 5
indiv trt time1 time2 time3
<int> <fct> <dbl> <dbl> <dbl>
1 1 a 1.17 -0.292 0.738
2 2 a 0.740 0.823 0.366
3 3 a 0.0754 -2.09 0.0477
4 1 b -0.407 -1.83 0.992
5 2 b 0.528 0.102 -0.695
6 3 b 0.621 -0.727 -0.523
In some cases we’ll want to make a wide dataset with new column names based on multiple variables in the
long dataset. In that case we can pass multiple variable names to names_from.
By default, the new column names will have an underscore (_) in them separating the information from the
two variables. The new column names are based on the order the variables are listed in names_from.
Now we have a 3 row dataset with quantitative values stored in 6 columns: we still have our original 18
pieces of information.
toy1long %>%
pivot_wider(names_from = c(trt, time),
values_from = measurement)
# A tibble: 3 x 7
indiv a_time1 a_time2 a_time3 b_time1 b_time2 b_time3
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.17 -0.292 0.738 -0.407 -1.83 0.992
2 2 0.740 0.823 0.366 0.528 0.102 -0.695
3 3 0.0754 -2.09 0.0477 0.621 -0.727 -0.523
We can change the symbol used in the new column names with names_sep. Here I also change the new
column names by changing order the variables are listed in names_from.
33
toy1long %>%
pivot_wider(names_from = c(time, trt),
values_from = measurement,
names_sep = ".")
# A tibble: 3 x 7
indiv time1.a time2.a time3.a time1.b time2.b time3.b
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1.17 -0.292 0.738 -0.407 -1.83 0.992
2 2 0.740 0.823 0.366 0.528 0.102 -0.695
3 3 0.0754 -2.09 0.0477 0.621 -0.727 -0.523
If the rows of the long dataset aren’t uniquely identified when converting into a wide format you will get a
warning message from pivot_wider().
For example, if we were trying to spread toy1long but we only had the trt variable and not the indiv
variable our rows wouldn’t be uniquely identified. It is only the combination of trt, indiv, and time that
uniquely identifies a row.
Let’s remove indiv from the dataset using select().
toy1long %>%
select(-indiv)
# A tibble: 18 x 3
trt time measurement
<fct> <chr> <dbl>
1 a time1 1.17
2 a time2 -0.292
3 a time3 0.738
4 a time1 0.740
5 a time2 0.823
6 a time3 0.366
7 a time1 0.0754
8 a time2 -2.09
9 a time3 0.0477
10 b time1 -0.407
11 b time2 -1.83
12 b time3 0.992
13 b time1 0.528
14 b time2 0.102
15 b time3 -0.695
16 b time1 0.621
17 b time2 -0.727
18 b time3 -0.523
There are now multiple observations of each time for each trt category; our rows are not uniquely identified.
Let’s see what happens when we use pivot_wider() on this dataset without changing the code.
In particular, take a look at the warning messages. These messages contain useful information about what
is in the output and why. The output dataset looks pretty different than what we’ve seen before because all
3 values for each trt and time were kept but placed into lists.
34
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement)
Warning: Values in `measurement` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(measurement = list)` to suppress this warning.
* Use `values_fn = list(measurement = length)` to identify where the duplicates arise
* Use `values_fn = list(measurement = summary_fun)` to summarise duplicates
# A tibble: 2 x 4
trt time1 time2 time3
<fct> <list> <list> <list>
1 a <dbl [3]> <dbl [3]> <dbl [3]>
2 b <dbl [3]> <dbl [3]> <dbl [3]>
If we really want to widen the dataset without indiv, we most likely want to summarize over the values for
each trt and time. This can be done using the values_fun argument. This is what the message
toy1long %>%
select(-indiv) %>%
pivot_wider(names_from = time,
values_from = measurement,
values_fn = list(measurement = mean) )
# A tibble: 2 x 4
trt time1 time2 time3
<fct> <dbl> <dbl> <dbl>
1 a 0.661 -0.520 0.384
2 b 0.247 -0.817 -0.0754
Practice reshaping
Before we move on to Part 3 of the workshop I want you take time to practice reshaping with pivot_wider()
and pivot_longer().
We will once again be working with the babynames dataset.
Practice problem 3
The third practice problem is based off of our work from practice problem 2. We calculated the total number
of baby names in the year we were born and in 2017 for each sex.
I didn’t name the final object, but I need to in order to use it in this problem. I’ll do that here, and print
the result so I remember what it looked like.
35
( numbaby_76_17 = babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup() )
# A tibble: 4 x 3
year sex n
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160
Reshape the dataset to a wide format. Make a dataset with a separate column for
each sex containing the number of baby names in a given year.
Now reshape the same dataset to different wide format. Make a dataset with a
separate column for each year containing the number of baby names in a given sex.
Take the dataset that has sex as separate columns and put this back in the original
format.
36
set.seed(16) # If I set the seed, we will all get the same random numbers
The unique identifier of each measurement in each dataset is a combination of site and treat; those are
the variables that we will use to tell R which rows within the two datasets to combine into one.
Let’s start our joining practice by joining these two datasets together using inner_join().
See the help page, ?join, to see a description of each type of join available in dplyr. In the documentation,
you will see that every join involves two datasets, called x and y, to be joined. The x dataset is the first
dataset you give to the join function and the y dataset is the second.
An inner join matches on the unique identifiers and returns only rows that are shared in both datasets.
From the documentation, an inner_join() will
return all rows from x where there are matching values in y, and all columns from x and y
37
By default, inner_join() joins on all columns shared by the two datasets. When we use this default, we
will get a message telling us which variables were used for joining when we run the code.
We’ll name our new combined dataset joined, and print the result.
To make our code more explicit and easily understandable, we can also use the by argument to define which
variables we want to join on. This is what I usually do.
We see above that the joined dataset only has 7 rows. This is because there are only 7 site-treatment
combinations that are present in both datasets. If we want to retain more rows, we’ll need a different kind
of join.
A left join is used when we want to keep all rows in the first dataset regardless of if they have a match in
the second dataset.
From the documentation, left_join() will
return all rows from x, and all columns from x and y. Rows in x with no match in y will have
NA values in the new columns.
So in our scenario, we should get 8 rows back because we have 8 rows in the first dataset (tojoin1). We
will still be missing a row for site 3 treatment “a”, as this is not present in the first dataset.
38
site treat count elev
1 1 a 7 1003.0419
2 1 b 4 1036.1179
3 1 c 6 977.3154
4 2 a 4 976.0533
5 2 b 9 984.7461
6 2 c 5 1017.7019
7 3 a 3 NA
8 3 b 8 1012.7026
There is also a right_join(), which we won’t practice today but works a lot like the left_join().
To keep all rows in both datasets regardless of a match, we can make a full join via full_join().
The full_join() function will
return all rows and all columns from both x and y. Where there are not matching values, returns
NA for the one missing.
This is how we can get rows for all nine site-treatment combinations.
There is an additional sentence in the documentation when describing the joins that we haven’t discussed
yet.
If there are multiple matches between x and y, all combination of the matches are returned.
This is an important topic to cover, as sometimes we want this behavior but other times this behavior helps
us uncover a mistake we are making.
For example, if we wanted to join a variable that was only measured at the “site” level, this behavior is
desirable. Let’s make a dataset that has a variable measured at the site level.
39
# A site level variable, the amount of rainfall
( tojoin3 = data.frame(site = 1:3,
rainfall = rgamma(3, 10, 1) ) )
site rainfall
1 1 16.43203
2 2 9.30625
3 3 11.35799
This new dataset only has 3 rows. Every treatment plot in the count dataset tojoin1 needs to be assigned
to the same value of the rainfall site-level variable. This means each row in the site-level dataset will be
matched to multiple rows in tojoin1 when joined, which is exactly what we want to happen. We end up
with an 8 row dataset.
This sort of behavior can cause unexpected results, though. If we join the original two joining datasets using
only site instead of the two variables that make up the unique identifier of each row, we will end up with
multiple matches per row. This leads us with a dataset that is much longer than expected.
When this sort of thing happens unexpectedly, we likely need to step back and evaluate whether or not we
have unique identifiers. We may need to rethink what we are doing versus what we want the final dataset
to look like.
# A tibble: 22 x 5
site treat.x count treat.y elev
<int> <fct> <int> <fct> <dbl>
1 1 a 7 b 1036.
2 1 a 7 c 977.
3 1 a 7 a 1003.
4 1 b 4 b 1036.
5 1 b 4 c 977.
6 1 b 4 a 1003.
7 1 c 6 b 1036.
8 1 c 6 c 977.
9 1 c 6 a 1003.
10 2 a 4 b 985.
# ... with 12 more rows
40
Using anti_join() to find missing data
The very last function we’ll learn today is yet another kind of join, called the anti_join().
An anti_join() will
return all rows from x where there are not matching values in y, keeping just columns from x.
This is great for figuring out which rows are missing matches between two datasets. In an anti-join, we want
to only return the values in the x dataset that are not in the y dataset.
Both anti_join() and the related semi_join() act more like filters than joins.
Here’s what this looks like, pulling out the row in tojoin1 that is missing from tojoin2. We see we are
missing treatment “a” at site 3 from tojoin2.
If we wanted to find the row in tojoin2 that is missing in tojoin1, we switch the order we put the datasets
in anti_join(). Now we see where are missing treatment “c” at site 3 from tojoin1.
tojoin1 %>%
anti_join( tojoin2, by = c("site", "treat") )
We can pipe the dataset as the y dataset, as well, using the . placeholder we saw earlier.
tojoin1 %>%
anti_join( tojoin2, ., by = c("site", "treat") )
Joins are an important skill to learn for data manipulation. The main take-home message here is that joins
can be used as part of a longer chain of data manipulation steps via the pipe.
41
Two additional dplyr functions
There are a couple other functions I use for data checking a lot, which I will put here at the end of the
workshop. We may not get to these during the workshop and so I have listed them here for reference.
The n_distinct() function is a useful function for counting up the number of unique values of a variable. I
use this most when I’m learning about a dataset that I don’t know well and want to understand the structure
of individual variables.
I also use n_distinct() when I think I have mistakes in a variable, such as a value of a categorical variable
being misentered. For example, if we know our dataset should only have 3 values for cyl we can check to
make sure our variable doesn’t contain more than that with n_distinct().
mtcars %>%
summarise( ncyl = n_distinct(cyl) )
ncyl
1 3
Another example is checking how many unique values of one variable is in each group. Here we’ll calculate
how many unique values of mpg there are in each cylinder category with n_distinct() and compare that
to the number of rows we have in that category calculated with n().
mtcars %>%
group_by(cyl) %>%
summarise( nmpg = n_distinct(mpg),
n = n() )
# A tibble: 3 x 3
cyl nmpg n
<dbl> <int> <int>
1 4 9 11
2 6 6 7
3 8 12 14
There are fewer unique mpg values (only 27) than there are rows in the dataset.
The last of the dplyr functions we will see is the distinct() function. This is the function we can use if
we need to remove duplicate-valued rows from our dataset.
For example, we saw above that we had fewer unique values of mpg in each cyl group than we had rows in
the dataset. Let’s pull out only the distinct values of mpg per cyl group.
The resulting dataset has 27 rows instead of the 32 rows of the original dataset. These are the rows that
contain the first of each unique value of mpg within each cylinder category.
42
mtcars %>%
group_by(cyl) %>%
distinct(mpg) %>%
ungroup()
# A tibble: 27 x 2
mpg cyl
<dbl> <dbl>
1 21 6
2 22.8 4
3 21.4 6
4 18.7 8
5 18.1 6
6 14.3 8
7 24.4 4
8 19.2 6
9 17.8 6
10 16.4 8
# ... with 17 more rows
Above we only kept the grouping variables and the variable we used to determine uniqueness. If we want to
keep all the variables when using distinct() we need the .keep_all argument.
mtcars %>%
group_by(cyl) %>%
distinct(mpg, .keep_all = TRUE) %>%
ungroup()
# A tibble: 27 x 11
mpg cyl disp hp drat wt qsec vs am gear carb
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl>
1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
3 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
4 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
5 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
6 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
7 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
8 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
9 17.8 6 168. 123 3.92 3.44 18.9 1 0 4 4
10 16.4 8 276. 180 3.07 4.07 17.4 0 0 3 3
# ... with 17 more rows
Answers problem 1
Which name was given to the largest number of babies in the year you were born?
43
I was born in 1976, so I first filter the dataset to that year. Since I wanted to find the name given to the
largest number of babies I then sorted by n in descending order.
The most babies were named Michael in 1976.
babynames %>%
filter(year == 1976) %>%
arrange(-n)
# A tibble: 17,391 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1976 M Michael 66964 0.0410
2 1976 F Jennifer 59474 0.0378
3 1976 M Jason 52681 0.0323
4 1976 M Christopher 45213 0.0277
5 1976 M David 39299 0.0241
6 1976 M James 38306 0.0235
7 1976 M John 34008 0.0208
8 1976 M Robert 33809 0.0207
9 1976 F Amy 31341 0.0199
10 1976 M Brian 30535 0.0187
# ... with 17,381 more rows
I need to filter the dataset to 2017 and babies named Michael. Notice I didn’t specify sex, and there were
both male and female babies named Michael in 2017.
babynames %>%
filter(year == 2017, name == "Michael")
# A tibble: 2 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 2017 F Michael 33 0.0000176
2 2017 M Michael 12579 0.00641
Go to practice problem 2
Answer problem 2
Calculate the total number of baby names for each level of the sex variable in the
year you were born and in 2017.
We didn’t see how to filter to multiple values, so we’ll need to make sure of the hint to do so.
I’ll first filter the dataset to only the years 1976 and 2017. Then I’ll group it by both year and sex so I can
add up the total number of rows in each year for each sex with summarise() and n(). This works because
each row in the babynames dataset is a unique name.
I end with ungroup() to make sure the final result isn’t grouped anymore.
There are more different baby names in 2017 compared to 1976 and in both years there were more unique
names for female babies compared to male babies.
44
babynames %>%
filter( year %in% c(1976, 2017) ) %>%
group_by(year, sex) %>%
summarise(n = n() ) %>%
ungroup()
# A tibble: 4 x 3
year sex n
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160
Answers problem 3
These questions are based on the final dataset from practice problem 2. My first step was to name this
object so I can use it to answer the question in problem 3.
Reshape the dataset from practice problem 2 to a wide format. Make a dataset with
a separate column for each sex containing the number of baby names in a given year.
Since we’re going from long to wide we’ll need pivot_wider(). The question specifically asks for separate
columns by sex, which tells me that this variable should be listed in names_from. The n variable is what I
need to fill the columns with so I use it as the values_from variable.
numbaby_76_17 %>%
pivot_wider(names_from = sex,
values_from = n)
# A tibble: 2 x 3
year F M
<dbl> <int> <int>
1 1976 10900 6491
2 2017 18309 14160
Now reshape the same dataset to different wide format. Make a dataset with a
separate column for each year containing the number of baby names in a given sex.
This is very similar to the first question except this time year is the names_from variable. Notice that
the result has backticks around the new column names, since having column names as numbers is not
syntactically valid in R.
45
numbaby_76_17 %>%
pivot_wider(names_from = year,
values_from = n)
# A tibble: 2 x 3
sex `1976` `2017`
<chr> <int> <int>
1 F 10900 18309
2 M 6491 14160
Take the dataset that has sex as separate columns and put this back in the original
format.
Since we are now going from ’wide" to “long” this involves using pivot_longer(). The two columns that
contain information I want to gather are F and M. I’ll call the new categorical column "sex" and the new
continuous column "num_name".
numbaby_76_17 %>%
pivot_wider(names_from = sex,
values_from = n) %>%
pivot_longer(cols = F:M,
names_to = "sex",
values_to = "num_name")
# A tibble: 4 x 3
year sex num_name
<dbl> <chr> <int>
1 1976 F 10900
2 1976 M 6491
3 2017 F 18309
4 2017 M 14160
46