Chapter-7-slides (1)
Chapter-7-slides (1)
Thomas Maierhofer
Fall 2024
1 / 104
Learning Objectives
2 / 104
Learning Objectives
3 / 104
Using R Packages
4 / 104
Introduction to R Packages
5 / 104
Loading a Package with library()
To use an installed package, load it with the library() function (quotations around
the name are optional). Loading a package allows access to its functions and data in
your current R session.
library(MASS)
library("MASS") # same thing
The library() function will throw an error if you try to load a package that has not
been installed on your computer.
library(whoops)
6 / 104
Viewing Loaded Packages with search()
The search() function outputs R’s current search path, showing which packages
and environments are loaded.
search()
The search path determines the order in which R looks for objects and functions. For
example:
▶ The Global Environment (“.GlobalEnv”) is first in the path, so R checks for
objects there before other packages.
▶ Assigning to pi in .GlobalEnv will mask the (built-in) pi from the base package.
7 / 104
Installing R Packages from CRAN
To install a package, call the install() function with the package’s name:
8 / 104
Installing R Packages from GitHub
▶ Many R packages are hosted on GitHub for open-source collaboration, including
experimental or developmental packages not yet on CRAN.
▶ To install packages from GitHub, use the devtools package, which provides tools
for downloading, installing, and managing GitHub packages.
Steps to Install a Package from GitHub
1. First, install devtools from CRAN if you haven’t already:
2. Use the install_github() function from devtools, inputting the username and
package name in the format “username/repository”.
library("devtools")
install_github("tidyverse/ggplot2")
9 / 104
Installing Packages using RStudio
10 / 104
Loading and Accessing Packages in R
11 / 104
Accessing Functions or Data Directly
Use :: to access a function from a package without loading it with library() using
the syntax package_name::function_name(), for example
## [1] 3
base::pi # the original in the base package is still there
13 / 104
Getting Help on Specific Functions
Use ? followed by the function name or use help() to view the documentation for
that function.
R documentation can provide insights on syntax and functionality, though it may not
always be user-friendly.
14 / 104
RStudio Help Tab
15 / 104
Searching for Functions by Topic
16 / 104
Help with a Package in General
Package Documentation
To receive help on a specific package (that is already installed), use the help
argument in the library() function, like in the example below:
library(help = "MASS")
Every function, data set, or other object included in an R package has a help page.
Vignettes
Most packages have a so-called vignette (french for small illustration), which usually
showcases the functionality of a package using small examples that you can follow
along as you write your own code. These tend to be really useful, so look for them on
CRAN / GitHub.
17 / 104
Getting Help in Practice
1. The documentation provided in the R help pages that come with a package
varies in quality, but should be your first stop for help.
2. Check whether there is a vignette, or a companion book or website. Chances
are, there is one, and chances are, they go through a problem that is similar
enough to yours that you can work your way along.
3. Google your problem, someone on stackoverflow and stackexchange might have
already solved your problem.
4. Try your luck with chatGPT or GitHub copilot, but chances are that if it’s not
solved on the internet already they don’t know how to do it either.
18 / 104
Back to R Packages
19 / 104
The data() Function
▶ The data() function loads datasets from packages in the search path and saves
a copy to the workspace.
▶ Many datasets are available in the datasets package, which is part of base R.
▶ Since the datasets package contains the trees data set it can be loaded directly
using data():
data(trees) # Load the trees object
ls() # The trees object has been added to the workspace
Question: How can we find out what type of trees were measured for this dataset?
20 / 104
Listing Datasets with data()
▶ The data() function can also list available datasets in a specific package.
▶ Use the package argument to specify the package you want to explore.
Example: Listing Datasets in MASS
The MASS package contains a dataset called geyser. To use it, load the package (if
not already loaded) and then load the dataset.
library(MASS) # Load the MASS package
data(geyser) # Load the geyser dataset
geyser[1:2,] # look at first two observations
## waiting duration
## 1 80 4.016667
## 2 71 2.150000
Question: Which geyser was measured for this dataset? When was this data collected?21 / 104
The head() Function
▶ For large datasets, printing the entire object is often impractical.
▶ The head() function outputs the first few values of the input object.
▶ For vectors, it returns the first few elements.
▶ For data frames and matrices, it outputs the first few rows.
## [1] 1 2 3 4 5 6 7 8 9 10 11 12
23 / 104
The tail() Function
The tail() function is similar to head() but outputs the last few values (or rows).
▶ Default: Returns the last 6 values or rows.
▶ A positive n argument returns the last n values, while a negative n argument
excludes the first n values.
tail(geyser) # Shows the last 6 rows of `geyser`
## waiting duration
## 294 87 2.133333
## 295 52 4.083333
## 296 85 2.066667
## 297 58 4.000000
## 298 88 4.000000
## 299 79 2.000000
## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
24 / 104
Data Frames
25 / 104
Basic Definitions and Functions
Matrices
▶ Require all values to be of the same type (e.g., all numeric, character, or logical).
▶ This structure can be too restrictive for statistical datasets with mixed types.
Data Frames
A data frame is a more flexible, two-dimensional array:
▶ Each column can be of a different type, supporting both numerical and
categorical data.
▶ Each row typically represents an observation, and each column represents a
variable.
Recall that we used the matrix() function to create a matrix of the numeric values in
the table.
parks_mat <- cbind(c(62, 71, 66), c(115, 201, 119), c(4000, NA, 2000))
rownames(parks_mat) <- c("Leslie", "Ron", "April")
colnames(parks_mat) <- c("Height", "Weight", "Income")
parks_mat
For the parks_df object, the Name variable is a column in the data frame, not the row
name. The ‘Name column has a different type than the other columns. 28 / 104
Matrix to Data frame Coercion
We can also use data.frame() or as.data.frame() to convert (coerce) matrices
into data frames. By converting parks_mat into a data frame, the row and column
names are preserved.
data.frame(parks_mat)
as.data.frame(parks_mat)
dim(parks_df)
## [1] 3 4
▶ The rownames(), colnames(), and dimnames() functions return row and
column names.
rownames(parks_df)
colnames(parks_df)
32 / 104
rbind(parks_df, list("Ron", 74, 194, 5000)) # Same thing
## [1] 2000
35 / 104
Preserving Data Frame Structure with drop = FALSE
▶ Note: Data frames consist of columns of vectors:
▶ If multiple columns are extracted, the result is still a data frame.
▶ If only one column is extracted, the result is converted to a vector by default.
▶ To preserve the data frame structure for single-column output, set drop =
FALSE in the square brackets (just like we did for matrices).
parks_df[1,] # still a data frame with one row
## Name
## 1 Leslie
## 2 Ron
## 3 April 36 / 104
Factor Variables in data.frame()
▶ The stringsAsFactors argument controls whether characters are coerced into
factors:
▶ R Version 4.0.2 or later: stringsAsFactors = FALSE by default.
▶ R Version 3.6.3 or earlier: stringsAsFactors = TRUE by default.
▶ Best Practice: To ensure consistent behavior, explicitly set stringsAsFactors
= FALSE in data.frame():
## Name
## 1 Leslie
## 2 Ron
## [1] 62 71 66
## [1] 115
Heads up:
▶ Data frames are internally stored in R as list objects whose components are the
column vectors.
▶ For lists, the components can be extracted using double square brackets [[]].
38 / 104
Accessing Columns using the $ Operator
For data frames (and lists) where the columns (or components) are typically named,
the $ operator is a good way to extract a single column. The left side of the $ contains
the data frame we want to extract from, and the right side contains the name of the
column to extract.
## [1] 62 71 66
39 / 104
Creating Columns using the $ Operator
Note: The $ operator is also able to add a new column (of the same length) to an
existing data frame. This can be an alternative to cbind().
parks_df # Does not have the Age variable
parks_df$Age <- c(34, 49, 20) # Add the Age variable to the parks_df object
parks_df
41 / 104
Modes and Classes in R
Key Takeaway
▶ Class affects display and user experience; mode affects internal storage and
compatibility with certain functions.
42 / 104
Why Modes and Classes Matter
▶ Functions often require inputs of specific classes or modes and may throw errors
otherwise.
▶ Functions may work completely differently based on the class of its input.
▶ Example: $ notation works with data frames but not matrices because data
frames are stored as lists.
parks_df$Name
## [1] "data.frame"
mode(parks_df)
## [1] "list"
# The class and mode of a matrix
class(parks_mat)
mode(parks_mat)
## [1] "numeric"
# The class and mode of a factor
44 / 104
class(factor(parks_df$Name))
Lists
45 / 104
Basic Definitions and Functions
46 / 104
Creating a List using list()
my_list <- list(
1:10,
matrix(1:6, nrow = 2, ncol = 3),
parks_df,
list(1:5, matrix(1:9, nrow = 3, ncol = 3))
)
my_list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6 47 / 104
[[3]]
Name Height Weight Income Age
1 Leslie 62 115 4000 34
2 Ron 71 201 NA 49
3 April 66 119 2000 20
[[4]]
[[4]][[1]]
[1] 1 2 3 4 5
[[4]][[2]]
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
48 / 104
Vector vs. List
49 / 104
Vector Functionality in Lists
Since lists are generic vectors, a few of the basic functions that work for vectors also
work for lists. The concatenation function c() for vectors can also be used to
concatenate lists together.
char_vec <- c("Pawnee Rules", "Eagleton Drools")
c(list(char_vec), my_list)
## [[1]]
## [1] "Pawnee Rules" "Eagleton Drools"
##
## [[2]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[3]]
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## [[4]]
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49 50 / 104
The length() of a List
The length() function, applied to a list, will return the number of (top level)
components in the list.
length(my_list)
## [1] 4
51 / 104
Names in Lists
The names() function can be used to assign or return the names of the components in
a list.
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6 52 / 104
Note: The names() function can also be used to add names to elements of vectors.
For data frames, names() is interchangeable with colnames().
53 / 104
Using str() to see the List Structure
54 / 104
Using str() to see the List Structure
You can use the str() function to get an overview over a list.
str(my_list)
## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
## ..$ : int [1:5] 1 2 3 4 5
## ..$ : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
55 / 104
compare this to the print() Output
When you execute the name of an object as an R command by itself, R will
automatically call the print() function on it and show its output.
print(my_list) # same as my_list
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## $‘Data Frame‘
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20
##
## $List 56 / 104
Extracting Data from Lists
57 / 104
The $ Operator for Subsetting Lists
When the components of a list are named, the $ operator can be used to extract a
single component. The left side of the $ contains the list we want to extract from, and
the right side contains the name of the component to extract.
my_list$Vector
## [1] 1 2 3 4 5 6 7 8 9 10
my_list$Matrix
58 / 104
Th component name "Data Frame" contains a space, so using the $ with the full
name requires backticks (or quotation marks) around the name.
my_list$List
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8 59 / 104
Partial Matching in $ Operator
The first few letters of the component name can be used, as long as there is no
ambiguity in which component is being referenced.
Since the name of every component of the my_list list starts with a different letter,
then we only need to type the first letter for the $ operator to know which component
to extract.
my_list$V # Vector
## [1] 1 2 3 4 5 6 7 8 9 10
# call the Function component with the Vector component as it's argument
my_list$Function(my_list$Vector) # compute the mean of the vector
## [1] 5.5
## [1] 5.5
61 / 104
Removing List Components using $
To remove a component from a list (or a column from a data frame), set the
component to NULL.
62 / 104
Double Square Brackets for Subsetting Lists
Double square brackets [[]] and $ operator can be used to extract top level
components from lists (and classes of objects stored as lists, like data frames).
## [1] 1 2 3 4 5 6 7 8 9 10
You cannot use negative indices within double square brackets [[]]
my_list[[-1]]
## [1] 1 3 5
Caution: The single index inside the double square brackets can be a single positive
numeric value or a single character for a name of component. Double square brackets
cannot be used to extract multiple top level components at a time.
64 / 104
How to Acess Lists within Lists
my_list[[4]] # get the 4th component of the list which is a list itself
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
65 / 104
You can access the second component of this sublist using
# get the 4th component of the list,
# then get the 2nd component of the sublist
my_list[[4]][[2]]
## [1] 3 6 9
66 / 104
Single Square Brackets for Lists
▶ Single square brackets [] behave similarly for lists as they do for atomic vectors.
▶ They allow subsetting of multiple components using numeric, character, or
logical indices.
my_list[1] # A list containing the first component
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
67 / 104
Negative Indices in Lists
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Matrix
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
68 / 104
Character Indices in Lists
# A list containing the "Vector" and "List" component
my_list[c("Vector", "List")]
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $List
## $List[[1]]
## [1] 1 2 3 4 5
##
## $List[[2]]
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
69 / 104
Logical Indices in Lists
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $‘Data Frame‘
## Name Height Weight Income Age
## 1 Leslie 62 115 4000 34
## 2 Ron 71 201 NA 49
## 3 April 66 119 2000 20
70 / 104
Difference between Single and Double Square Brackets for Lists
## $Vector
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 1 2 3 4 5 6 7 8 9 10
71 / 104
Your turn:
How would you access Ron’s weight in my_list using
▶ numeric indices only.
▶ character indices only.
▶ logical indices only.
No mixing of indices allowed! Also, get Ron’s weight with as many $ operators as
possible.
str(my_list) # this might help
## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
72 / 104
Solution:
my_list[[3]][2,3]
## [1] 201
my_list[["Data Frame"]]["2","Weight"]
## [1] 201
## [1] 201
my_list$`Data Frame`$Weight[2] # use dollar sign for list and data frame in
## [1] 201
74 / 104
Assigned Reading: Chapter 4 of Advanced R
We’ve learned how to index / subset:
▶ vectors
▶ matrices
▶ lists
▶ data frames
using:
▶ logical
▶ numeric
▶ character
indices.
For a compact review, read Chapter 4 - Subsetting in Hadley Wickham’s Advanced
R book (Link to Chapter 4)
75 / 104
Vectorized Functions for Data Frames and Lists
76 / 104
Vectorized Functions for Vectors
## [1] 5 7 9
Some generic vectorized functions also provide useful summaries for list components
or data frame columns. Object Oriented Programming allows the same function to
behave very differently for objects of different classes, say vectors, matrices, lists, and
data frames.
77 / 104
Seeing an Object’s Structure using the str() Function
▶ Purpose: str() provides a compact overview of an object’s internal structure.
▶ Ideal for quickly understanding the composition of complex objects, such as data
frames and nested lists.
▶ also works for vectors
str(1:5)
## int [1:5] 1 2 3 4 5
str(c(1, 2, 3, 4, 5))
## num [1:5] 1 2 3 4 5
str(TRUE)
str() shows the number of observations and variables. Each variable is summarized,
including the data type (e.g., num fro numeric) and a preview of values.
79 / 104
str() for Nested Lists
str() shows the length of the list. The structure of each component is shown.
str(my_list)
## List of 4
## $ Vector : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ Matrix : int [1:2, 1:3] 1 2 3 4 5 6
## $ Data Frame:’data.frame’: 3 obs. of 5 variables:
## ..$ Name : chr [1:3] "Leslie" "Ron" "April"
## ..$ Height: num [1:3] 62 71 66
## ..$ Weight: num [1:3] 115 201 119
## ..$ Income: num [1:3] 4000 NA 2000
## ..$ Age : num [1:3] 34 49 20
## $ List :List of 2
## ..$ : int [1:5] 1 2 3 4 5
## ..$ : int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
80 / 104
Summarizing an R Object using the summary() Function
summary() computes the 5 number summary (and the mean) on numeric vectors.
summary(1:10)
summary(c(1, 2, 3, 4, 5))
summary(trees)
summary(parks_df)
The summary() function computes the length, class, and mode of each component. I
prefer the str() function to get an overview over lists.
summary(my_list)
83 / 104
The apply Family
84 / 104
The apply Family of Functions
A powerful feature in R for minimizing the use of loops and repetitive code. through
vectorized functions that apply operations efficiently across elements of data
structures.
Common apply Functions:
▶ apply(): Applies a function over rows or columns of matrices or data frames.
▶ lapply(): Applies a function over each element of a list and returns a list.
▶ sapply(): A version of lapply() that tries to combine outputs into a vector or
matrix.
▶ vapply(): Similar to sapply() but you have to specify the output type.
▶ tapply(): Applies a function over subsets of a vector split by factors.
85 / 104
The apply() Function
Recall that the apply() function is used to apply a function to the rows or columns
(the margins) of matrices or data frames.
# Compute the mean of every column of the trees data frame
apply(X = trees, MARGIN = 2, FUN = mean) # output is named vector
Note: Remember that the output of apply() will be a matrix if the applied function
returns a vector with more than one element.
87 / 104
Caution with apply() on Data Frames
Question:
▶ What does apply(parks_df, 2, mean) output?
▶ Why might this not give the intended results?
▶ How can we compute the mean of each numeric column in parks_df using
apply()?
88 / 104
The lapply() Function
Note that there is no margin argument like in apply(), as lists have a single index.
89 / 104
# Return the length of each component in the L list
lapply(X = my_list, FUN = length)
## $Vector
## [1] 10
##
## $Matrix
## [1] 6
##
## $‘Data Frame‘
## [1] 5
##
## $List
## [1] 2
90 / 104
lapply() for Data Frames
Since data frames are (stored as) lists, lapply() also works for data frames.
# Compute the range (min and max) of every column of the trees data frame
lapply(X = trees, FUN = range)
## $Girth
## [1] 8.3 20.6
##
## $Height
## [1] 63 87
##
## $Volume
## [1] 10.2 77.0
## $Girth
## [1] 16 17
##
## $Height
## [1] 12 13
##
## $Volume
## [1] 11
92 / 104
The sapply() Function
▶ sapply() is a wrapper function for lapply():
▶ Internally calls lapply() to apply a function to each component of a list.
▶ Attempts to simplify the output from lapply() whenever possible.
Simplification Rules:
▶ If the result is a list with each component as a scalar, sapply() returns a vector.
▶ If the result is a list with each component as a vector of the same length (greater
than 1), sapply() returns a matrix.
▶ If the components are not of the same length, sapply() will return a list (same
as lapply()).
Key Insight
▶ sapply() simplifies results where possible, making it useful for cleaner and more
interpretable outputs.
93 / 104
Comparison of sapply() and lapply()
By using lapply(), we found the length of each component of list my_list. Notice
the difference when using sapply().
sapply(X = my_list, FUN = length) # output is simplified to named vector
## $Vector
## [1] 10
##
## $Matrix
## [1] 6
##
## $‘Data Frame‘
## [1] 5
##
## $List 94 / 104
sapply(X = trees, FUN = range) # output is simplified to named matrix
## $Girth
## [1] 8.3 20.6
##
## $Height
## [1] 63 87
##
## $Volume
## [1] 10.2 77.0
95 / 104
sapply() vs apply() on Data Frames
sapply(X = trees, FUN = range) and apply(X = trees, MARGIN = 2, FUN =
range) return the same output.
sapply(trees, range)
97 / 104
The vapply() Function
98 / 104
Specifying Output Types in vapply()
Since the range() function returns a numeric vector of length 2, we would set
FUN.VALUE = numeric(2).
100 / 104
Why Use vapply() Over sapply()?
Observation: - sapply() is more flexible and easy to use. - This flexibility can be
risky when ensuring consistent output type and length.
Benefits of vapply():
▶ Strictness:
▶ The required FUN.VALUE argument ensures the output has the expected structure.
▶ Alerts users to unexpected results by throwing errors if the output type does not
match.
▶ Predictable Output:
▶ Safer for functions with consistent output structures.
▶ Reduces the risk of unforeseen issues in larger scripts or workflows.
101 / 104
Predicting sapply() Outputs is Hard
It can be hard to predict what type or even class your sapply() output will be. This
makes it difficult to build code on top of it.
102 / 104
sapply(X = trees, FUN = which_median) # output is list
## $Girth
## [1] 16 17
##
## $Height
## [1] 12 13
##
## $Volume
## [1] 11
103 / 104
Optional Reading: Chapter 16 of Roger Peng’s R Programming
for Data Science
104 / 104