How To Work With List Columns
How To Work With List Columns
August 2018
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
n_iris$data[[1]]
Garrett Grolemund
n_iris 6.5 2.8 4.6 1.5
n_iris$data[[2]]
CC-BY-4.0
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
n_iris$data[[3]]
library(babynames)
filter(babynames, name == "Mary")
6
2 1880 M Mary 27 0.000228
CC by RStudio
7 1881 F Anna 2698 0.0273
library(babynames)
filter(babynames, name == "Mary")
CC by RStudio
7 1883 F Mary 8012 0.0667
library(babynames)
babynames
setosa 2.35
versi 1.89
setosa 2.35
versi 1.89
virginica 0.69
setosa 2.35
versi 1.89
setosa 2.35
versi 1.89
setosa 2.35
versi 1.89
virginica 0.69
6
2 1880 M Mary 27 0.000228
Species beta
setosa 2.35
versi 1.89
setosa 2.35
versi 1.89
setosa 2.35
versi 1.89
setosa
versi
beta
2.35
1.89
setosa 2.35
versi 1.89
CC by RStudio
versi 1.89 1.89
CC by RStudio
mutate()
summarise()
Composing
map() group_by()
functions
(purrr)
%>%
types
Manipulating
atomic
Data List Data Structures
filter()
Structures (dplyr)
vectors
Columns arrange()
lists
select()
data
classes
frames
Tibbles What is
factors tidy?
datetimes
CC by RStudio
mutate()
1.Data Structures inComposing
R summarise()
2.Data(purrr)
frames, tibbles,
map()
and list columns
functions group_by()
4.Composing
atomic
vectors
Structures functions
Columns (dplyr)
arrange()
5.map() functions
lists
select()
6.Case Study
classes
data
frames
Tibbles What is
factors 7.Tao of Tidy
datetimes
tidy?
CC by RStudio
! rstd.io/list-columns
!
code.Rmd slides.pdf rstd.io/purrr-cheatsheet
rstd.io/dplyr-cheatsheet
CC by RStudio
Data Structures
in R
1
[1] 1
c(1, 2, 3.14)
[1] 1.00 2.00 3.14
CC by RStudio
is.vector(1)
TRUE
is.vector(c(1, 2, 3.14))
TRUE
CC by RStudio
typeof(c(1L, 2L, 3L))
[1] "integer"
typeof(c(1, 2, 3.14))
[1] "double"
CC by RStudio
typeof(c("a", "b", "c"))
[1] "character"
typeof(c(TRUE, FALSE))
[1] "logical"
CC by RStudio
x <- c(1L, 2L, 3L)
x
[1] 1 2 3
CC by RStudio
x <- c(1L, 2L, 3L)
class(x) <- "Date"
x
"1970-01-02" "1970-01-03" "1970-01-04"
CC by RStudio
x <- c(1L, 2L, 3L)
levels(x) <- c("Blue", "Brown", "Green")
class(x) <- "factor"
x
[1] Blue Brown Green
Levels: Blue Brown Green
CC by RStudio
dim(x) <- c(3, 1)
x
[,1]
[1,] Blue
[2,] Brown
[3,] Green
Levels: Blue Brown Green
CC by RStudio
(y <- list(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE)))
$a
[1] 1.00 2.00 3.14
$b
[1] "a" "b" "c"
$c
[1]
CC by RStudio TRUE FALSE FALSE
typeof(y)
[1] "list"
is.vector(y)
[1] TRUE
CC by RStudio
class(y) <- "data.frame"
rownames(y) <- c("1", "2", "3")
y
a b c
1 1.00 a TRUE
2 2.00 b FALSE
3 3.14 c FALSE
CC by RStudio
Should this work?
y$d <- list(p = 1:3, q = TRUE, r = 0L)
y
a b c d
1 1.00 a TRUE 1, 2, 3
e s!
2 2.00 b FALSE TRUE Y
3 3.14 c FALSE 0
CC by RStudio
y$d
$p
[1] 1 2 3
$q
[1] TRUE
$r
[1] 0
CC by RStudio
Data frames,
Tibbles, and
List Columns
data.frame(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE))
a b c
1 1.00 a TRUE
2 2.00 b FALSE
3 3.14 c FALSE
CC by RStudio
data.frame(list(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE)))
a b c
1 1.00 a TRUE
2 2.00 b FALSE
3 3.14 c FALSE
CC by RStudio
data.frame(list(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE)))
a b c
1 1.00 a TRUE
2 2.00 b FALSE
3 3.14 c FALSE
CC by RStudio
data.frame(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE),
d = list(p = 1:3, q = TRUE, r = 0L))
CC by RStudio
z <- data.frame(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE))
a b c
1 1.00 a TRUE
2 2.00 b FALSE
3 3.14 c FALSE
CC by RStudio
z$d <- list(p = 1:30, q = TRUE, r = 0L)
z
CC by RStudio
library(tibble)
class(z) <- c("tbl_df", "tbl", "data.frame")
z
# A tibble: 3 x 4
a b c d
<dbl> <fct> <lgl> <list>
1 1 a TRUE <int [30]>
2 2 b FALSE <lgl [1]>
3 3.14 c FALSE <int [1]>
CC by RStudio
library(tibble)
class(z) <- c("tbl_df", "tbl", "data.frame")
as_tibble(z)
# A tibble: 3 x 4
a b c d
<dbl> <fct> <lgl> <list>
1 1 a TRUE <int [30]>
2 2 b FALSE <lgl [1]>
3 3.14 c FALSE <int [1]>
CC by RStudio
data.frame(a = c(1, 2, 3.14),
b = c("a", "b", "c"),
c = c(TRUE, FALSE, FALSE),
d = list(p = 1:3, q = TRUE, r = 0L))
tibble display
156 1999 6 auto(l4)
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4)
160 1999 4 manual(m5)
161 1999 4 auto(l4)
162 2008 4 manual(m5)
163 2008 4 manual(m5)
164 2008 4 auto(l4)
165 2008 4 auto(l4)
166 1999 4 auto(l4)
[ reached getOption("max.print")
A large table to -- omitted 68 rows ]
CC by RStudio
Single table
verbs +
dplyr single table verbs
CC by RStudio
tidyr nest() "cell"
Sepal.L Sepal.W Petal.L Petal.W
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
Species S.L S.W P.L P.W Species S.L S.W P.L P.W 4.7 3.2 1.3 0.2
setosa 5.1 3.5 1.4 0.2 setosa 5.1 3.5 1.4 0.2 4.6 3.1 1.5 0.2
setosa 4.9 3.0 1.4 0.2 setosa 4.9 3.0 1.4 0.2 5.0 3.6 1.4 0.2
setosa 4.7 3.2 1.3 0.2 setosa 4.7 3.2 1.3 0.2 n_iris$data[[1]]
setosa 4.6 3.1 1.5 0.2 setosa 4.6 3.1 1.5 0.2
setosa 5.0 3.6 1.4 0.2 setosa 5.0 3.6 1.4 0.2 nested data frame Sepal.L Sepal.W Petal.L Petal.W
versi 7.0 3.2 4.7 1.4 versi 7.0 3.2 4.7 1.4 Species data 7.0 3.2 4.7 1.4
versi 6.4 3.2 4.5 1.5 versi 6.4 3.2 4.5 1.5 setosa <tibble [50 x 4]> 6.4 3.2 4.5 1.5
versi 6.9 3.1 4.9 1.5 versi 6.9 3.1 4.9 1.5 versicolor <tibble [50 x 4]> 6.9 3.1 4.9 1.5
virginica <tibble [50 x 4]> 5.5 2.3 4.0 1.3
versi 5.5 2.3 4.0 1.3 versi 5.5 2.3 4.0 1.3
versi 6.5 2.8 4.6 1.5 versi 6.5 2.8 4.6 1.5 n_iris 6.5 2.8 4.6 1.5
virginica 6.3 3.3 6.0 2.5 virginica 6.3 3.3 6.0 2.5 n_iris$data[[2]]
virginica 5.8 2.7 5.1 1.9 virginica 5.8 2.7 5.1 1.9
virginica 7.1 3.0 5.9 2.1 virginica 7.1 3.0 5.9 2.1 Sepal.L Sepal.W Petal.L Petal.W
virginica 6.3 2.9 5.6 1.8 virginica 6.3 2.9 5.6 1.8 6.3 3.3 6.0 2.5
virginica 6.5 3.0 5.8 2.2 virginica 6.5 3.0 5.8 2.2 5.8 2.7 5.1 1.9
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
n_iris$data[[3]]
CC by RStudio
tidyr unnest()
Species S.L S.W P.L P.W
setosa 5.1 3.5 1.4 0.2
setosa 4.9 3.0 1.4 0.2
setosa 4.7 3.2 1.3 0.2
setosa 4.6 3.1 1.5 0.2
nested data frame setosa 5.0 3.6 1.4 0.2
Species data versi 7.0 3.2 4.7 1.4
setosa <tibble [50 x 4]> versi 6.4 3.2 4.5 1.5
versicolor <tibble [50 x 4]> 6.9 3.1 4.9 1.5
versi
virginica <tibble [50 x 4]>
versi 5.5 2.3 4.0 1.3
versi 6.5 2.8 4.6 1.5
virginica 6.3 3.3 6.0 2.5
virginica 5.8 2.7 5.1 1.9
virginica 7.1 3.0 5.9 2.1
virginica 6.3 2.9 5.6 1.8
virginica 6.5 3.0 5.8 2.2
CC by RStudio
Composing
functions
%>%
CC by RStudio
pipes
x %>% f(y)
becomes f(x, y)
%>%
CC by RStudio
y
a b c d
1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]>
CC by RStudio
y %>% mutate(asq = sqrt(a))
a b c d
1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]>
a b c d asq
1.00 a TRUE <int [3]> 1.00
2.00 b FALSE <lgl [1]> 1.41
3.14 c FALSE <int [1]> 1.77
CC by RStudio
y %>% mutate(asq = )
a b c d a b c d asq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
a b c d asq
1.00 a TRUE <int [3]> 1.00
2.00 b FALSE <lgl [1]> 1.41
3.14 c FALSE <int [1]> 1.77
CC by RStudio
y %>% mutate(asq = )
a b c d a b c d asq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
a b c d asq
1.00 1.00 1.00 a TRUE <int [3]> 1.00
sqrt( 2.00 ) 1.41 2.00 b FALSE <lgl [1]> 1.41
3.14 1.77 3.14 c FALSE <int [1]> 1.77
CC by RStudio
y %>% mutate(asq = )
a b c d a b c d asq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
a b c d asq
1.00 1.00 1.00 a TRUE <int [3]> 1.00
sqrt( 2.00 ) 1.41 2.00 b FALSE <lgl [1]> 1.41
3.14 1.77 3.14 c FALSE <int [1]> 1.77
CC by RStudio
y %>% mutate(asq = sqrt(a))
CC by RStudio
y %>% mutate(dsq = sqrt(d))
a b c d
1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]>
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
<int [3]>
sqrt( <lgl [1]> ) Error!
<int [1]>
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
<int [3]> < [ ]>
map( <lgl [1]> , ) < [ ]>
<int [1]> < [ ]>
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE <int [3]> 1.00 a TRUE <int [3]>
2.00 b FALSE <lgl [1]> 2.00 b FALSE <lgl [1]>
3.14 c FALSE <int [1]> 3.14 c FALSE <int [1]>
<int [3]> < [ ]>
map( <lgl [1]> , sqrt) < [ ]>
<int [1]> < [ ]>
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE 1, 2, 3 1.00 a TRUE <int [3]>
2.00 b FALSE TRUE 2.00 b FALSE <lgl [1]>
3.14 c FALSE 0 3.14 c FALSE <int [1]>
<int [3]> < [ ]>
map( <lgl [1]> , sqrt) < [ ]>
<int [1]> < [ ]>
sqrt(c(1, 2, 3))
sqrt(TRUE)
sqrt(0)
CC by RStudio
y %>% mutate(dsq = )
a b c d a b c d dsq
1.00 a TRUE 1, 2, 3 1.00 a TRUE <int [3]>
2.00 b FALSE TRUE 2.00 b FALSE <lgl [1]>
3.14 c FALSE 0 3.14 c FALSE <int [1]>
<int [3]> <dbl [3]>
map( <lgl [1]> , sqrt) <dbl [1]>
<int [1]> <dbl [1]>
a b c d dsq
sqrt(c(1, 2, 3))
1.00 a TRUE <int [3]> <dbl [3]>
sqrt(TRUE)
2.00 b FALSE <lgl [1]> <dbl [1]>
sqrt(0) 3.14 c FALSE <int [1]> <dbl [1]>
CC by RStudio
y %>% mutate(dsq = map(d, sqrt))
CC by RStudio
What is the slope?
What is the R-squared? (fit)
CC by RStudio
joe_mod <- lm(prop ~ year, data = joe)
joe_mod
Call:
lm(formula = prop ~ year, data = joe)
Coefficients:
(Intercept) year
1.178e-01 -5.857e-05
CC by RStudio
joe_mod <- lm(prop ~ year, data = joe)
coef(joe_mod)
(Intercept) year
1.178179e-01 -5.857169e-05
CC by RStudio
joe_mod <- lm(prop ~ year, data = joe)
pluck(coef(joe_mod), "year")
[1] -5.857169e-05
CC by RStudio
joe_mod <- lm(prop ~ year, data = joe)
pluck(coef(joe_mod), "year")
[1] -5.857169e-05
library(broom)
glance(joe_mod)
r.squared adj.r.squared sigma statistic
1 0.8584798 0.8574236 0.0009405581 812.8611 9.
CC by RStudio
joe_mod <- lm(prop ~ year, data = joe)
pluck(coef(joe_mod), "year")
[1] -5.857169e-05
library(broom)
pluck(glance(joe_mod), "r.squared")
[1] 0.8584798
CC by RStudio
babynames # A tibble: 126,888 x 5
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
CC by RStudio
# ... with 126,878 more rows
babynames %>% # A tibble: 126,888 x 5
# Groups: name, sex [933]
group_by(name, sex)
year sex name n prop
<dbl> <chr> <chr> <int> <dbl>
1 1880 F Mary 7065 0.0724
2 1880 F Anna 2604 0.0267
3 1880 F Emma 2003 0.0205
4 1880 F Elizabeth 1939 0.0199
5 1880 F Minnie 1746 0.0179
6 1880 F Margaret 1578 0.0162
7 1880 F Ida 1472 0.0151
8 1880 F Alice 1414 0.0145
9 1880 F Bertha 1320 0.0135
10 1880 F Sarah 1288 0.0132
CC by RStudio
# ... with 126,878 more rows
babynames %>% # A tibble: 933 x 3
CC by RStudio
# ... with 923 more rows
Sanity check: What is in one of these cells?
babynames %>% # A tibble: 136 x 3
year n prop
group_by(name, sex) %>% <dbl> <int> <dbl>
nest() %>% 1 1880 7065 0.0724
2 1881 6919 0.0700
pluck("data") %>%
3 1882 8148 0.0704
pluck(1) 4 1883 8012 0.0667
5 1884 9217 0.0670
6 1885 9128 0.0643
7 1886 9889 0.0643
8 1887 9888 0.0636
9 1888 11754 0.0620
10 1889 11648 0.0616
CC by RStudio # ... with 126 more rows
# A tibble: 933 x 4
babynames %>%
name sex data model
group_by(name, sex) %>% <chr> <chr> <list> <list>
nest() %>% 1 Mary F <tibble [136 × 3]> <S3: lm>
2 Anna F <tibble [136 × 3]> <S3: lm>
mutate(
3 Emma F <tibble [136 × 3]> <S3: lm>
model = map(data, 4 Elizabeth F <tibble [136 × 3]> <S3: lm>
~lm(prop ~ year, 5 Minnie F <tibble [136 × 3]> <S3: lm>
6 Margaret F <tibble [136 × 3]> <S3: lm>
data = .x))
7 Ida F <tibble [136 × 3]> <S3: lm>
) 8 Alice F <tibble [136 × 3]> <S3: lm>
9 Bertha F <tibble [136 × 3]> <S3: lm>
10 Sarah F <tibble [136 × 3]> <S3: lm>
# ... with 923 more rows
CC by RStudio
# A tibble: 933 x 5
babynames %>% name sex data model slope
group_by(name, sex) %>% <chr> <chr> <list> <list> <dbl>
1 Mary F <tibble [136 × 3]> <S3: lm> -0.000577
nest() %>% 2 Anna F <tibble [136 × 3]> <S3: lm> -0.000179
"year"))
)
CC by RStudio
# A tibble: 933 x 6
babymods <- babynames %>% name sex data model slope r_squared
nest() %>% 2 Anna F <tibble [136 × 3]> <S3: lm> -0.000179 0.708
3 Emma F <tibble [136 × 3]> <S3: lm> -0.0000657 0.230
mutate( 4 Elizabeth F <tibble [136 × 3]> <S3: lm> -0.0000725 0.704
5 Minnie F <tibble [136 × 3]> <S3: lm> -0.0000966 0.644
model = map(data, 6 Margaret F <tibble [136 × 3]> <S3: lm> -0.000173 0.803
~lm(prop ~ year, 7 Ida F <tibble [136 × 3]> <S3: lm> -0.0000862 0.719
8 Alice F <tibble [136 × 3]> <S3: lm> -0.000110 0.901
data = .x)), 9 Bertha F <tibble [136 × 3]> <S3: lm> -0.0000948 0.756
10 Sarah F <tibble [136 × 3]> <S3: lm> 0.00000845 0.00705
slope = map_dbl(model, # ... with 923 more rows
~pluck(coef(.x),
"year")),
r_squared = map_dbl(model,
~pluck(glance(.x),"r.squared")))
CC by RStudio
# A tibble: 933 x 6
name sex data model slope r_squared
<chr> <chr> <list> <list> <dbl> <dbl>
1 Mary F <tibble [136 × 3]> <S3: lm> -0.000577 0.914
2 Anna F <tibble [136 × 3]> <S3: lm> -0.000179 0.708
3 Emma F <tibble [136 × 3]> <S3: lm> -0.0000657 0.230
4 Elizabeth F <tibble [136 × 3]> <S3: lm> -0.0000725 0.704
5 Minnie F <tibble [136 × 3]> <S3: lm> -0.0000966 0.644
6 Margaret F <tibble [136 × 3]> <S3: lm> -0.000173 0.803
7 Ida F <tibble [136 × 3]> <S3: lm> -0.0000862 0.719
8 Alice F <tibble [136 × 3]> <S3: lm> -0.000110 0.901
9 Bertha F <tibble [136 × 3]> <S3: lm> -0.0000948 0.756
10 Sarah F <tibble [136 × 3]> <S3: lm> 0.00000845 0.00705
# ...
CC by RStudio
with 923 more rows
Which names increased the most?
babymods %>%
arrange(desc(slope)) %>%
head(5) %>%
unnest(data) %>%
ggplot(mapping = aes(x = year, y = prop)) +
geom_line(mapping = aes(color = name))
CC by RStudio
CC by RStudio
Which names were the least linear?
babymods %>%
arrange(r_squared) %>%
head(5) %>%
unnest(data) %>%
ggplot(mapping = aes(x = year, y = prop)) +
geom_line(mapping = aes(color = name))
CC by RStudio
CC by RStudio
The tao of tidy
"Data are not just numbers,
they are numbers with a context."
CC by RStudio
Two types of context
prop = n x 100?
CC by RStudio
Tidy Data
A B C A B C
&
Each variable is in Each observation, or
its own column case, is in its own row
CC by RStudio
The tidyverse creates a system for working with tidy data
CC by RStudio
List Columns
CC by RStudio
How to work with
List Columns 5.1
4.9
4.7
"cell" contents
August 2018
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
n_iris$data[[1]]
Garrett Grolemund
n_iris 6.5 2.8 4.6 1.5
n_iris$data[[2]]
CC-BY-4.0
7.1 3.0 5.9 2.1
6.3 2.9 5.6 1.8
6.5 3.0 5.8 2.2
n_iris$data[[3]]