0% found this document useful (0 votes)
16 views104 pages

Chapter-7-slides (1)

Uploaded by

levinali1225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views104 pages

Chapter-7-slides (1)

Uploaded by

levinali1225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

STATS 20: Chapter 7 - Data Frames and Lists

Thomas Maierhofer

Fall 2024

1 / 104
Learning Objectives

2 / 104
Learning Objectives

After studying this chapter, you should be able to:


▶ Install and load packages in R.
▶ Access and interpret the R Help Documentation for built-in objects and functions.
▶ Load datasets from packages.
▶ Create data frames and lists.
▶ Differentiate between matrices and data frames.
▶ Extract and assign values to data frames and lists.
▶ Understand the difference between the mode and the class of an object.
▶ Summarize an R object with str() and summary().
▶ Understand how and when to use the apply family of functions: apply(),
lapply(), sapply(), vapply(), tapply().

3 / 104
Using R Packages

4 / 104
Introduction to R Packages

▶ An R package is a collection of functions, data, and documentation.


▶ Base R packages are included in the initial R download and loaded automatically
in each session:
▶ Examples: base, stats, graphics, datasets.
▶ R packages are stored on your computer in a library, which is a directory
containing installed packages.

5 / 104
Loading a Package with library()
To use an installed package, load it with the library() function (quotations around
the name are optional). Loading a package allows access to its functions and data in
your current R session.

library(MASS)
library("MASS") # same thing

The library() function will throw an error if you try to load a package that has not
been installed on your computer.

library(whoops)

## Error in library(whoops): there is no package called ’whoops’

6 / 104
Viewing Loaded Packages with search()
The search() function outputs R’s current search path, showing which packages
and environments are loaded.

search()

## [1] ".GlobalEnv" "package:MASS" "package:stats"


## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"

The search path determines the order in which R looks for objects and functions. For
example:
▶ The Global Environment (“.GlobalEnv”) is first in the path, so R checks for
objects there before other packages.
▶ Assigning to pi in .GlobalEnv will mask the (built-in) pi from the base package.
7 / 104
Installing R Packages from CRAN

▶ R packages expand the capabilities of R with additional functions and datasets.


▶ Many packages are created by R users and researchers, but they are not included
in the initial R download.
▶ CRAN (Comprehensive R Archive Network) is the largest repository of R
packages available online.
▶ CRAN does pretty minimal quality control, so while it is safe to download
packages from there and install them on your computer, they might not work as
advertised
▶ Packages from CRAN can be installed using the install.packages() function.

To install a package, call the install() function with the package’s name:

install.packages("boot") # to install the boot package

8 / 104
Installing R Packages from GitHub
▶ Many R packages are hosted on GitHub for open-source collaboration, including
experimental or developmental packages not yet on CRAN.
▶ To install packages from GitHub, use the devtools package, which provides tools
for downloading, installing, and managing GitHub packages.
Steps to Install a Package from GitHub
1. First, install devtools from CRAN if you haven’t already:

install.packages("devtools") # handy tools for R developers

2. Use the install_github() function from devtools, inputting the username and
package name in the format “username/repository”.

library("devtools")
install_github("tidyverse/ggplot2")
9 / 104
Installing Packages using RStudio

10 / 104
Loading and Accessing Packages in R

▶ One-Time Installation: Packages only need to be installed once per computer.


▶ Loading Packages: After installation, use library() to load the package and
access its functions and data.
▶ Caution To use functions or datasets from a package in a new session, you must
load the package again with library():
▶ Each time you quit and reopen R, packages need to be reloaded with library().

11 / 104
Accessing Functions or Data Directly
Use :: to access a function from a package without loading it with library() using
the syntax package_name::function_name(), for example

ggplot2::qplot() # use the qplot function from the ggplot2 package

This makes sense when:


▶ you only need one function or data set from a package.
▶ you need to use two functions from different packages with the same name.
▶ you still want to use a function that was masked by a package that was loaded
later
pi <- 3 # overwrite pi in the global environment
pi # that's the one that R will find

## [1] 3
base::pi # the original in the base package is still there

## [1] 3.141593 12 / 104


Getting Help

13 / 104
Getting Help on Specific Functions

Use ? followed by the function name or use help() to view the documentation for
that function.

?mean # Displays documentation for mean()


help(mean) # Equivalent to ?mean

R documentation can provide insights on syntax and functionality, though it may not
always be user-friendly.

14 / 104
RStudio Help Tab

15 / 104
Searching for Functions by Topic

▶ If you’re unsure of a function’s name, use ?? followed by a search term or


help.search() to perform a fuzzy search:
▶ ?? searches across all help files of installed packages and returns matches based
on alias, concept, or title.

?regression # searches function (or other object) named regression within l


??regression # Searches for keyword "regression" in all help files
help.search("regression") # Equivalent to ??regression

16 / 104
Help with a Package in General

Package Documentation
To receive help on a specific package (that is already installed), use the help
argument in the library() function, like in the example below:

library(help = "MASS")

Every function, data set, or other object included in an R package has a help page.

Vignettes
Most packages have a so-called vignette (french for small illustration), which usually
showcases the functionality of a package using small examples that you can follow
along as you write your own code. These tend to be really useful, so look for them on
CRAN / GitHub.

17 / 104
Getting Help in Practice

1. The documentation provided in the R help pages that come with a package
varies in quality, but should be your first stop for help.
2. Check whether there is a vignette, or a companion book or website. Chances
are, there is one, and chances are, they go through a problem that is similar
enough to yours that you can work your way along.
3. Google your problem, someone on stackoverflow and stackexchange might have
already solved your problem.
4. Try your luck with chatGPT or GitHub copilot, but chances are that if it’s not
solved on the internet already they don’t know how to do it either.

18 / 104
Back to R Packages

19 / 104
The data() Function
▶ The data() function loads datasets from packages in the search path and saves
a copy to the workspace.
▶ Many datasets are available in the datasets package, which is part of base R.
▶ Since the datasets package contains the trees data set it can be loaded directly
using data():
data(trees) # Load the trees object
ls() # The trees object has been added to the workspace

## [1] "pi" "trees"

trees[1:2,] # look at first two observations

## Girth Height Volume


## 1 8.3 70 10.3
## 2 8.6 65 10.3

Question: How can we find out what type of trees were measured for this dataset?
20 / 104
Listing Datasets with data()
▶ The data() function can also list available datasets in a specific package.
▶ Use the package argument to specify the package you want to explore.
Example: Listing Datasets in MASS

data(package = "MASS") # Lists available datasets in the MASS package

The MASS package contains a dataset called geyser. To use it, load the package (if
not already loaded) and then load the dataset.
library(MASS) # Load the MASS package
data(geyser) # Load the geyser dataset
geyser[1:2,] # look at first two observations

## waiting duration
## 1 80 4.016667
## 2 71 2.150000

Question: Which geyser was measured for this dataset? When was this data collected?21 / 104
The head() Function
▶ For large datasets, printing the entire object is often impractical.
▶ The head() function outputs the first few values of the input object.
▶ For vectors, it returns the first few elements.
▶ For data frames and matrices, it outputs the first few rows.

head(trees) # Shows the first few rows of the `trees` dataset

## Girth Height Volume


## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2
## 4 10.5 72 16.4
## 5 10.7 81 18.8
## 6 10.8 83 19.7
22 / 104
Customizing Output with head()
The n argument in head() controls how many values (or rows) to output.
▶ Default: n = 6, so head() outputs the first six values.
▶ Negative n: Returns all values except the last n values.

head(trees, n = 3) # First 3 rows of `trees`

## Girth Height Volume


## 1 8.3 70 10.3
## 2 8.6 65 10.3
## 3 8.8 63 10.2

head(1:20, n = -8) # Returns all values except the last 8

## [1] 1 2 3 4 5 6 7 8 9 10 11 12
23 / 104
The tail() Function
The tail() function is similar to head() but outputs the last few values (or rows).
▶ Default: Returns the last 6 values or rows.
▶ A positive n argument returns the last n values, while a negative n argument
excludes the first n values.
tail(geyser) # Shows the last 6 rows of `geyser`

## waiting duration
## 294 87 2.133333
## 295 52 4.083333
## 296 85 2.066667
## 297 58 4.000000
## 298 88 4.000000
## 299 79 2.000000

tail(1:20, n = -5) # Returns all values except the first 5

## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
24 / 104
Data Frames

25 / 104
Basic Definitions and Functions
Matrices
▶ Require all values to be of the same type (e.g., all numeric, character, or logical).
▶ This structure can be too restrictive for statistical datasets with mixed types.

Data Frames
A data frame is a more flexible, two-dimensional array:
▶ Each column can be of a different type, supporting both numerical and
categorical data.
▶ Each row typically represents an observation, and each column represents a
variable.

Why Data Frames?


Data frames mirror the layout of a tidy statistical datasets, where each column
corresponds to a variable, making them ideal for handling diverse data types.
26 / 104
Consider the table of data on the employees at the Pawnee Parks and Recreation
Department, introduced in the previous chapter.

Name Height (inches) Weight (pounds) Income ($/month)


Leslie 62 115 4000
Ron 71 201 (Redacted)
April 66 119 2000

Recall that we used the matrix() function to create a matrix of the numeric values in
the table.
parks_mat <- cbind(c(62, 71, 66), c(115, 201, 119), c(4000, NA, 2000))
rownames(parks_mat) <- c("Leslie", "Ron", "April")
colnames(parks_mat) <- c("Height", "Weight", "Income")
parks_mat

## Height Weight Income


## Leslie 62 115 4000
## Ron 71 201 NA
## April 66 119 2000
27 / 104
The data.frame() function inputs multiple vectors of the same length and outputs a
data frame with each column corresponding to the vectors (in order). We can set
column (variable) names naming the vectors:

parks_df <- data.frame(


Name = c("Leslie", "Ron", "April"),
Height = c(62, 71, 66),
Weight = c(115, 201, 119),
Income = c(4000, NA, 2000)
)
parks_df

## Name Height Weight Income


## 1 Leslie 62 115 4000
## 2 Ron 71 201 NA
## 3 April 66 119 2000

For the parks_df object, the Name variable is a column in the data frame, not the row
name. The ‘Name column has a different type than the other columns. 28 / 104
Matrix to Data frame Coercion
We can also use data.frame() or as.data.frame() to convert (coerce) matrices
into data frames. By converting parks_mat into a data frame, the row and column
names are preserved.

data.frame(parks_mat)

## Height Weight Income


## Leslie 62 115 4000
## Ron 71 201 NA
## April 66 119 2000

as.data.frame(parks_mat)

## Height Weight Income


## Leslie 62 115 4000
## Ron 71 201 NA
## April 66 119 2000 29 / 104
Matrix Functionality in Data Frames
Many of the same basic functions for matrices also work for data frames.
▶ The dim() function outputs the dimension of the input data frame.

dim(parks_df)

## [1] 3 4
▶ The rownames(), colnames(), and dimnames() functions return row and
column names.

rownames(parks_df)

## [1] "1" "2" "3"

colnames(parks_df)

## [1] "Name" "Height" "Weight" "Income" 30 / 104


cbind() Data Frames
The cbind() function combines (binds) columns of data frames together. The vectors
or data frames should contain the same number of rows/observations.
Side Note: Recycling with cbind() works differently for data frames as it does for
matrices. Column binding with a vector will automatically recycle only if it will be
completely recycled. If incomplete recycling might occur, then it will throw an error.

cbind(parks_df, "Age" = c(34, 49, 20))

## Name Height Weight Income Age


## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20

cbind(parks_df, "Age" = c(34, 49))

## Error in data.frame(..., check.names = FALSE): arguments imply differing


31 / 104
rbind() Data Frames
The rbind() combines rows of data frames together. Since different values in rows
are allowed to be different types, added rows are typically other data frames (or lists).
Merging rows from two data frames can get complicated, though, because the names
of the columns in each data frame should correspond to the names in the other.

# Create a data frame with a new observation


ron_dunn <- data.frame(Name = "Ron", Height = 74, Weight = 194, Income = 50
rbind(parks_df, ron_dunn)

## Name Height Weight Income


## 1 Leslie 62 115 4000
## 2 Ron 71 201 NA
## 3 April 66 119 2000
## 4 Ron 74 194 5000

32 / 104
rbind(parks_df, list("Ron", 74, 194, 5000)) # Same thing

## Name Height Weight Income


## 1 Leslie 62 115 4000
## 2 Ron 71 201 NA
## 3 April 66 119 2000
## 4 Ron 74 194 5000

rbind(parks_df, c("Ron", 74, 194, 5000)) # converts ALL variables in the da

## Name Height Weight Income


## 1 Leslie 62 115 4000
## 2 Ron 71 201 <NA>
## 3 April 66 119 2000
## 4 Ron 74 194 5000

Question: What is different about the command rbind(parks_df, c("Ron", 74,


194, 5000))?
33 / 104
Extracting Data from Data Frames
Using Square Brackets
▶ Data frames, as two-dimensional objects, allow for square bracket indexing with
row and column indices:
▶ Format: [i, j] where i is the row and j is the column.
▶ Logical and named indices will also work just like in matrices.
\footnote size

parks_df[1, ] # Extract the first row

## Name Height Weight Income


## 1 Leslie 62 115 4000

parks_df[, -1] # Remove the first column

## Height Weight Income


## 1 62 115 4000 34 / 104
parks_df[-2, 3] # Remove the second row and extract the third column

## [1] 115 119

parks_df[, "Name"] # Extract the Names column

## [1] "Leslie" "Ron" "April"

parks_df[c(FALSE, FALSE, TRUE), "Income"] # Extract the third entry from th

## [1] 2000

35 / 104
Preserving Data Frame Structure with drop = FALSE
▶ Note: Data frames consist of columns of vectors:
▶ If multiple columns are extracted, the result is still a data frame.
▶ If only one column is extracted, the result is converted to a vector by default.
▶ To preserve the data frame structure for single-column output, set drop =
FALSE in the square brackets (just like we did for matrices).
parks_df[1,] # still a data frame with one row

## Name Height Weight Income


## 1 Leslie 62 115 4000
parks_df[, "Name"] # simplifies to a vector

## [1] "Leslie" "Ron" "April"


parks_df[, "Name", drop = FALSE] # Keeps the output as a data frame with one column

## Name
## 1 Leslie
## 2 Ron
## 3 April 36 / 104
Factor Variables in data.frame()
▶ The stringsAsFactors argument controls whether characters are coerced into
factors:
▶ R Version 4.0.2 or later: stringsAsFactors = FALSE by default.
▶ R Version 3.6.3 or earlier: stringsAsFactors = TRUE by default.
▶ Best Practice: To ensure consistent behavior, explicitly set stringsAsFactors
= FALSE in data.frame():

data.frame(Name = c("Leslie", "Ron"), stringsAsFactors = FALSE)

## Name
## 1 Leslie
## 2 Ron

Important Caution Carefully check if a column is stored as a character or factor. To


reassign a value in a factor column, we need to use the methods that we use for factors.
37 / 104
Double Square Brackets
For data frames the columns can be extracted using double square brackets [[]],
either referring to the column by numeric index or by name.
parks_df[[1]] # Extract the first column (which is Name)

## [1] "Leslie" "Ron" "April"


parks_df[["Height"]] # Extract the Height column

## [1] 62 71 66

parks_df[[3]][1] # Extract the first element of the third column (Weight)

## [1] 115

Heads up:
▶ Data frames are internally stored in R as list objects whose components are the
column vectors.
▶ For lists, the components can be extracted using double square brackets [[]].
38 / 104
Accessing Columns using the $ Operator

For data frames (and lists) where the columns (or components) are typically named,
the $ operator is a good way to extract a single column. The left side of the $ contains
the data frame we want to extract from, and the right side contains the name of the
column to extract.

parks_df$Height # Extract the Height column from parks_df

## [1] 62 71 66

parks_df$Income # Extract the Income column from parks_df

## [1] 4000 NA 2000

39 / 104
Creating Columns using the $ Operator
Note: The $ operator is also able to add a new column (of the same length) to an
existing data frame. This can be an alternative to cbind().
parks_df # Does not have the Age variable

## Name Height Weight Income


## 1 Leslie 62 115 4000
## 2 Ron 71 201 NA
## 3 April 66 119 2000

parks_df$Age <- c(34, 49, 20) # Add the Age variable to the parks_df object
parks_df

## Name Height Weight Income Age


## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20 40 / 104
Insert: A Nod to Object Oriented Programming

41 / 104
Modes and Classes in R

▶ The class of an object modifies the behavior of functions* in R:


▶ Calling plot(parks_df) creates a panel of ncol(parks_df) by ncol(parks_df)
scatterplots.
▶ Calling plot(parks_mat) creates just a scatterplot of the first two columns.
▶ The mode of an object is R’s internal storage method:
▶ For example, data frames are stored as lists, allowing columns of different types.
▶ A matrix is stored as a long vector with all elements of the same type.

Key Takeaway
▶ Class affects display and user experience; mode affects internal storage and
compatibility with certain functions.

42 / 104
Why Modes and Classes Matter
▶ Functions often require inputs of specific classes or modes and may throw errors
otherwise.
▶ Functions may work completely differently based on the class of its input.
▶ Example: $ notation works with data frames but not matrices because data
frames are stored as lists.

parks_df$Name

## [1] "Leslie" "Ron" "April"


# reminder: all infix operators are really just functions
`$`(parks_df, Name)

## [1] "Leslie" "Ron" "April"

# the $ function does not work on objects of class matrix


parks_mat$Name 43 / 104
# The class and mode of a data frame
class(parks_df)

## [1] "data.frame"

mode(parks_df)

## [1] "list"
# The class and mode of a matrix
class(parks_mat)

## [1] "matrix" "array"

mode(parks_mat)

## [1] "numeric"
# The class and mode of a factor
44 / 104
class(factor(parks_df$Name))
Lists

45 / 104
Basic Definitions and Functions

▶ A list is an ordered collection of objects.


▶ Lists are possibly the most flexible objects in R.
▶ Each component in a list can be any other object in R, including
▶ vectors
▶ matrices
▶ data frames
▶ functions
▶ other lists

46 / 104
Creating a List using list()
my_list <- list(
1:10,
matrix(1:6, nrow = 2, ncol = 3),
parks_df,
list(1:5, matrix(1:9, nrow = 3, ncol = 3))
)

my_list

## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6 47 / 104
[[3]]
Name Height Weight Income Age
1 Leslie 62 115 4000 34
2 Ron 71 201 NA 49
3 April 66 119 2000 20

[[4]]
[[4]][[1]]
[1] 1 2 3 4 5

[[4]][[2]]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9

48 / 104
Vector vs. List

A vector is an ordered collection of values.


▶ In this sense, lists are vectors too, so lists are sometimes called
▶ recursive vectors. Lists can contain lists of lists of lists of . . .
▶ generic vectors. Not specific, i.e., they can contain any type of entry and even
mixed entries.
▶ The vector objects we have worked with so far are sometimes called atomic (from
latin atomus “indivisible”) vectors, since their components cannot be broken down
into smaller components. Each entry is one character / double / integer that
cannot be subset any further.

49 / 104
Vector Functionality in Lists
Since lists are generic vectors, a few of the basic functions that work for vectors also
work for lists. The concatenation function c() for vectors can also be used to
concatenate lists together.
char_vec <- c("Pawnee Rules", "Eagleton Drools")
c(list(char_vec), my_list)

## [[1]]
## [1] "Pawnee Rules" "Eagleton Drools"
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## [[4]]
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49 50 / 104
The length() of a List

The length() function, applied to a list, will return the number of (top level)
components in the list.

length(my_list)

## [1] 4

51 / 104
Names in Lists
The names() function can be used to assign or return the names of the components in
a list.

names(my_list) <- c("Vector", "Matrix", "Data Frame", "List")


names(my_list)

## [1] "Vector" "Matrix" "Data Frame" "List"


my_list

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6 52 / 104
Note: The names() function can also be used to add names to elements of vectors.
For data frames, names() is interchangeable with colnames().

first_five <- 1:5


names(first_five) <- c("One", "Two", "Three", "Four", "Five")
first_five

## One Two Three Four Five


## 1 2 3 4 5

names(parks_df) # Same as colnames(parks_df)

## [1] "Name" "Height" "Weight" "Income" "Age"

53 / 104
Using str() to see the List Structure

54 / 104
Using str() to see the List Structure
You can use the str() function to get an overview over a list.

str(my_list)

## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
## ..$ : int [1:5] 1 2 3 4 5
## ..$ : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
55 / 104
compare this to the print() Output
When you execute the name of an object as an R command by itself, R will
automatically call the print() function on it and show its output.
print(my_list) # same as my_list

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## $‘Data Frame‘
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20
##
## $List 56 / 104
Extracting Data from Lists

57 / 104
The $ Operator for Subsetting Lists

When the components of a list are named, the $ operator can be used to extract a
single component. The left side of the $ contains the list we want to extract from, and
the right side contains the name of the component to extract.

my_list$Vector

## [1] 1 2 3 4 5 6 7 8 9 10

my_list$Matrix

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

58 / 104
Th component name "Data Frame" contains a space, so using the $ with the full
name requires backticks (or quotation marks) around the name.

my_list$`Data Frame` # my_list$"Data Frame" works too

## Name Height Weight Income Age


## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20

my_list$List

## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8 59 / 104
Partial Matching in $ Operator
The first few letters of the component name can be used, as long as there is no
ambiguity in which component is being referenced.
Since the name of every component of the my_list list starts with a different letter,
then we only need to type the first letter for the $ operator to know which component
to extract.

my_list$V # Vector

## [1] 1 2 3 4 5 6 7 8 9 10

my_list$D # Data Frame

## Name Height Weight Income Age


## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20
60 / 104
Adding List Components using $
Just like for data frames (which are lists), the $ operator is also able to add a new
component to an existing list.

my_list$Function <- mean


names(my_list) # A Function component has been added to the list

## [1] "Vector" "Matrix" "Data Frame" "List" "Function"

# call the Function component with the Vector component as it's argument
my_list$Function(my_list$Vector) # compute the mean of the vector

## [1] 5.5

mean(my_list$Vector) # this is the same

## [1] 5.5
61 / 104
Removing List Components using $

To remove a component from a list (or a column from a data frame), set the
component to NULL.

my_list$Function <- NULL


names(my_list)

## [1] "Vector" "Matrix" "Data Frame" "List"

62 / 104
Double Square Brackets for Subsetting Lists
Double square brackets [[]] and $ operator can be used to extract top level
components from lists (and classes of objects stored as lists, like data frames).

my_list[[1]] # A vector of length 10

## [1] 1 2 3 4 5 6 7 8 9 10

my_list[[2]] # A 2x3 matrix

## [,1] [,2] [,3]


## [1,] 1 3 5
## [2,] 2 4 6

You cannot use negative indices within double square brackets [[]]
my_list[[-1]]

## Error in my_list[[-1]]: invalid negative subscript in get1index <real> 63 / 104


Chained Subsetting of Lists

my_list[[2]][1, ] # The first row of the 2x3 matrix

## [1] 1 3 5

Caution: The single index inside the double square brackets can be a single positive
numeric value or a single character for a name of component. Double square brackets
cannot be used to extract multiple top level components at a time.

64 / 104
How to Acess Lists within Lists

The fourth component of my_list is a list with two components itself:

my_list[[4]] # get the 4th component of the list which is a list itself

## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

65 / 104
You can access the second component of this sublist using
# get the 4th component of the list,
# then get the 2nd component of the sublist
my_list[[4]][[2]]

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9

You can chain these subsetting operations as much as you like


# get the 4th component of the list,
# then get the 2nd component of the sublist,
# then get the 3rd row
my_list[[4]][[2]][3,]

## [1] 3 6 9
66 / 104
Single Square Brackets for Lists
▶ Single square brackets [] behave similarly for lists as they do for atomic vectors.
▶ They allow subsetting of multiple components using numeric, character, or
logical indices.
my_list[1] # A list containing the first component

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10

my_list[1:2] # A list containing the first and second component

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
67 / 104
Negative Indices in Lists

my_list[-c(3, 4)] # A list excluding the 3rd and 4th components

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6

68 / 104
Character Indices in Lists
# A list containing the "Vector" and "List" component
my_list[c("Vector", "List")]

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List
## $List[[1]]
## [1] 1 2 3 4 5
##
## $List[[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
69 / 104
Logical Indices in Lists

my_list[c(TRUE, FALSE)] # A list containing every other component

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $‘Data Frame‘
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20

70 / 104
Difference between Single and Double Square Brackets for Lists

Difference between [] and [[]]:


▶ [] outputs a list object.
▶ [[]] outputs the component inside the list.

my_list[1] # Returns a list containing the first element

## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10

my_list[[1]] # Returns the first element itself (not as a list)

## [1] 1 2 3 4 5 6 7 8 9 10

71 / 104
Your turn:
How would you access Ron’s weight in my_list using
▶ numeric indices only.
▶ character indices only.
▶ logical indices only.
No mixing of indices allowed! Also, get Ron’s weight with as many $ operators as
possible.
str(my_list) # this might help

## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
72 / 104
Solution:
my_list[[3]][2,3]

## [1] 201

my_list[["Data Frame"]]["2","Weight"]

## [1] 201

my_list[c(FALSE, FALSE, TRUE, FALSE)][[TRUE]][c(FALSE, TRUE, FALSE),c(FALSE

## [1] 201

Notice that this won’t work for logical indices:

my_list[[c(FALSE, FALSE, TRUE, FALSE)]] # which is why we need the awkward

## Error in my_list[[c(FALSE, FALSE, TRUE, FALSE)]]: attempt to select73 /less


104
Solution ctd:

my_list$`Data Frame`$Weight[2] # use dollar sign for list and data frame in

## [1] 201

74 / 104
Assigned Reading: Chapter 4 of Advanced R
We’ve learned how to index / subset:
▶ vectors
▶ matrices
▶ lists
▶ data frames

using:
▶ logical
▶ numeric
▶ character

indices.
For a compact review, read Chapter 4 - Subsetting in Hadley Wickham’s Advanced
R book (Link to Chapter 4)
75 / 104
Vectorized Functions for Data Frames and Lists

76 / 104
Vectorized Functions for Vectors

▶ Vectorized Functions: Automatically apply a function to individual components


of an object.
▶ For (atomic) vectors, vector arithmetic performs operations
element-by-element.

c(1, 2, 3) + c(4, 5, 6) # Adds corresponding elements

## [1] 5 7 9

Some generic vectorized functions also provide useful summaries for list components
or data frame columns. Object Oriented Programming allows the same function to
behave very differently for objects of different classes, say vectors, matrices, lists, and
data frames.

77 / 104
Seeing an Object’s Structure using the str() Function
▶ Purpose: str() provides a compact overview of an object’s internal structure.
▶ Ideal for quickly understanding the composition of complex objects, such as data
frames and nested lists.
▶ also works for vectors
str(1:5)

## int [1:5] 1 2 3 4 5

str(c(1, 2, 3, 4, 5))

## num [1:5] 1 2 3 4 5

str(c("This", "is", "text", "!"))

## chr [1:4] "This" "is" "text" "!"

str(TRUE)

## logi TRUE 78 / 104


str() for Data Frames

str() shows the number of observations and variables. Each variable is summarized,
including the data type (e.g., num fro numeric) and a preview of values.

str(trees) # Display the structure of the trees object

## ’data.frame’: 31 obs. of 3 variables:


## $ Girth : num 8.3 8.6 8.8 10.5 10.7 10.8 11 11 11.1 11.2 ...
## $ Height: num 70 65 63 72 81 83 66 75 80 75 ...
## $ Volume: num 10.3 10.3 10.2 16.4 18.8 19.7 15.6 18.2 22.6 19.9 ...

79 / 104
str() for Nested Lists
str() shows the length of the list. The structure of each component is shown.

str(my_list)

## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
## ..$ : int [1:5] 1 2 3 4 5
## ..$ : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
80 / 104
Summarizing an R Object using the summary() Function
summary() computes the 5 number summary (and the mean) on numeric vectors.

summary(1:10)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1.00 3.25 5.50 5.50 7.75 10.00

summary(c(1, 2, 3, 4, 5))

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 1 2 3 3 4 5

It can handle any type of vector:

summary(c(TRUE, TRUE, FALSE, NA, FALSE))

## Mode FALSE TRUE NA’s


81 / 104
summary() for Data Frames
The summary() function computes: - Summary statistics for numeric columns (i.e.,
Min, Median, Mean, Max). - Frequencies for character or factor columns.

summary(trees)

## Girth Height Volume


## Min. : 8.30 Min. :63 Min. :10.20
## 1st Qu.:11.05 1st Qu.:72 1st Qu.:19.40
## Median :12.90 Median :76 Median :24.20
## Mean :13.25 Mean :76 Mean :30.17
## 3rd Qu.:15.25 3rd Qu.:80 3rd Qu.:37.30
## Max. :20.60 Max. :87 Max. :77.00

summary(parks_df)

## Name Height Weight Income


## Length:3 Min. :62.00 Min. :115 Min. :2000 82 / 104
summary() for Lists

The summary() function computes the length, class, and mode of each component. I
prefer the str() function to get an overview over lists.

summary(my_list)

## Length Class Mode


## Vector 10 -none- numeric
## Matrix 6 -none- numeric
## Data Frame 5 data.frame list
## List 2 -none- list

83 / 104
The apply Family

84 / 104
The apply Family of Functions
A powerful feature in R for minimizing the use of loops and repetitive code. through
vectorized functions that apply operations efficiently across elements of data
structures.
Common apply Functions:
▶ apply(): Applies a function over rows or columns of matrices or data frames.
▶ lapply(): Applies a function over each element of a list and returns a list.
▶ sapply(): A version of lapply() that tries to combine outputs into a vector or
matrix.
▶ vapply(): Similar to sapply() but you have to specify the output type.
▶ tapply(): Applies a function over subsets of a vector split by factors.

Other Members (Not Covered):


▶ mapply(), rapply(), eapply()

85 / 104
The apply() Function
Recall that the apply() function is used to apply a function to the rows or columns
(the margins) of matrices or data frames.
# Compute the mean of every column of the trees data frame
apply(X = trees, MARGIN = 2, FUN = mean) # output is named vector

## Girth Height Volume


## 13.24839 76.00000 30.17097
# Compute the mean of every row of the trees data frame
apply(X = trees, MARGIN = 1, FUN = mean) # output is unnamed vector

## [1] 29.53333 27.96667 27.33333 32.96667 36.83333 37.83333 30.86667 34.7


## [9] 37.90000 35.36667 38.16667 36.13333 36.26667 34.00000 35.36667 36.3
## [17] 43.90000 42.23333 36.80000 34.23333 42.16667 41.96667 41.60000 42.1
## [25] 45.30000 51.23333 51.73333 52.06667 49.83333 49.66667 61.53333
86 / 104
# Compute the range (min and max) of every column of the trees data frame
apply(X = trees, MARGIN = 2, FUN = range) # output is matrix with column na

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0

Note: Remember that the output of apply() will be a matrix if the applied function
returns a vector with more than one element.

87 / 104
Caution with apply() on Data Frames

▶ apply() is primarily designed for matrices (and higher-dimensional arrays).


▶ Applying apply() to a data frame will first coerce the data frame into a matrix
using as.matrix().
▶ Issue: If columns in the data frame are of different types, coercion can lead to
unexpected results.

Question:
▶ What does apply(parks_df, 2, mean) output?
▶ Why might this not give the intended results?
▶ How can we compute the mean of each numeric column in parks_df using
apply()?

88 / 104
The lapply() Function

The lapply() function is used to apply a function to each component of a list


(lapply is short for “list apply”). The output of lapply() will be a list.
The syntax of lapply() is lapply(X, FUN, ...), where the arguments are:
▶ X: A list
▶ FUN: The function to be applied.
▶ ...: Any optional arguments to be passed to the FUN function.

Note that there is no margin argument like in apply(), as lists have a single index.

89 / 104
# Return the length of each component in the L list
lapply(X = my_list, FUN = length)

## $Vector
## [1] 10
##
## $Matrix
## [1] 6
##
## $‘Data Frame‘
## [1] 5
##
## $List
## [1] 2

90 / 104
lapply() for Data Frames
Since data frames are (stored as) lists, lapply() also works for data frames.

# Compute the range (min and max) of every column of the trees data frame
lapply(X = trees, FUN = range)

## $Girth
## [1] 8.3 20.6
##
## $Height
## [1] 63 87
##
## $Volume
## [1] 10.2 77.0

Question: How is apply(trees, 2, range) different from lapply(trees,


range)?
91 / 104
Taking Advantage of lapply()s List Output
The list output from lapply() is particularly useful when the result from each
component may have a different length (or even a different dimension or class).

which_median <- function(x) {


which(x == median(x))
}
lapply(X = trees, FUN = which_median)

## $Girth
## [1] 16 17
##
## $Height
## [1] 12 13
##
## $Volume
## [1] 11
92 / 104
The sapply() Function
▶ sapply() is a wrapper function for lapply():
▶ Internally calls lapply() to apply a function to each component of a list.
▶ Attempts to simplify the output from lapply() whenever possible.

Simplification Rules:
▶ If the result is a list with each component as a scalar, sapply() returns a vector.
▶ If the result is a list with each component as a vector of the same length (greater
than 1), sapply() returns a matrix.
▶ If the components are not of the same length, sapply() will return a list (same
as lapply()).

Key Insight
▶ sapply() simplifies results where possible, making it useful for cleaner and more
interpretable outputs.
93 / 104
Comparison of sapply() and lapply()
By using lapply(), we found the length of each component of list my_list. Notice
the difference when using sapply().
sapply(X = my_list, FUN = length) # output is simplified to named vector

## Vector Matrix Data Frame List


## 10 6 5 2

lapply(X = my_list, FUN = length) # list output

## $Vector
## [1] 10
##
## $Matrix
## [1] 6
##
## $‘Data Frame‘
## [1] 5
##
## $List 94 / 104
sapply(X = trees, FUN = range) # output is simplified to named matrix

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0

lapply(X = trees, FUN = range) # list output

## $Girth
## [1] 8.3 20.6
##
## $Height
## [1] 63 87
##
## $Volume
## [1] 10.2 77.0
95 / 104
sapply() vs apply() on Data Frames
sapply(X = trees, FUN = range) and apply(X = trees, MARGIN = 2, FUN =
range) return the same output.

sapply(trees, range)

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0

apply(trees, 2, range) # same output for an all-numeric data set

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0
96 / 104
sapply() vs apply() on Data Frames

Reminder: A data frame is stored as a list of column vectors.


▶ sapply() applies functions to each column as a list.
▶ apply() with MARGIN = 2 applies functions to each column of the matrix
version (as.matrix(trees)).
▶ For the all numeric trees data set, the output is equivalent.

Key Difference: Data Coercion:


▶ apply() coerces a data frame into a matrix (as.matrix()), which may change
data types.
▶ sapply() and lapply() do not coerce columns, preserving their original types.
Question: How is apply(X = parks_df, MARGIN = 2, FUN = mean) different
from sapply(X = parks_df, FUN = mean)?

97 / 104
The vapply() Function

▶ Purpose: Applies a function to each element of an atomic vector or list.


▶ Similar to sapply(), but requires specifying the expected return type with
FUN.VALUE.
▶ FUN.VALUE: Specifies the type and length of the output. This ensures more
predictable outputs compared to sapply().

98 / 104
Specifying Output Types in vapply()

vapply(X = trees, FUN = mean, FUN.VALUE = numeric(1)) # output is named vec

## Girth Height Volume


## 13.24839 76.00000 30.17097

Since the range() function returns a numeric vector of length 2, we would set
FUN.VALUE = numeric(2).

vapply(trees, range, numeric(2)) # output is matrix with colnames

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0
99 / 104
Remember that vapply() will throw an error if the FUN.VALUE is set to a return type
that is not what we are expecting.

vapply(trees, mean, numeric(2))

## Error in vapply(trees, mean, numeric(2)): values must be length 2,


## but FUN(X[[1]]) result is length 1

100 / 104
Why Use vapply() Over sapply()?

Observation: - sapply() is more flexible and easy to use. - This flexibility can be
risky when ensuring consistent output type and length.
Benefits of vapply():
▶ Strictness:
▶ The required FUN.VALUE argument ensures the output has the expected structure.
▶ Alerts users to unexpected results by throwing errors if the output type does not
match.
▶ Predictable Output:
▶ Safer for functions with consistent output structures.
▶ Reduces the risk of unforeseen issues in larger scripts or workflows.

101 / 104
Predicting sapply() Outputs is Hard
It can be hard to predict what type or even class your sapply() output will be. This
makes it difficult to build code on top of it.

sapply(X = trees, FUN = median) # output is vector

## Girth Height Volume


## 12.9 76.0 24.2

sapply(X = trees, FUN = range) # output is matrix

## Girth Height Volume


## [1,] 8.3 63 10.2
## [2,] 20.6 87 77.0

102 / 104
sapply(X = trees, FUN = which_median) # output is list

## $Girth
## [1] 16 17
##
## $Height
## [1] 12 13
##
## $Volume
## [1] 11

vapply(X = trees, FUN = which_median, numeric(1)) # this will just throw an

## Error in vapply(X = trees, FUN = which_median, numeric(1)): values must


## but FUN(X[[1]]) result is length 2

103 / 104
Optional Reading: Chapter 16 of Roger Peng’s R Programming
for Data Science

We have now covered (almost) the entire apply family of functions:


▶ apply(): Applies a function over rows or columns of matrices or data frames.
▶ lapply(): Applies a function over each element of a list and returns a list.
▶ sapply(): A version of lapply() that tries to combine outputs into a vector or
matrix.
▶ vapply(): Similar to sapply() but you have to specify the output type.
▶ tapply(): Applies a function over subsets of a vector split by factors.

we did not (and will not) cover: - mapply(), rapply(), eapply()


Feel free to read up on the whole family in Roger Peng’s R Programming for Data
Science. (Link to Chapter 16)

104 / 104

You might also like