R Course ISLR Basics 2023
R Course ISLR Basics 2023
Karin Groothuis-Oudshoorn
September 7, 2023
1
Program today
• Introduction R / RStudio
• Working with vectors
• Different object types / classes
• Working with tables
• Making graphs with ggplot2
• Common errors
2
Why R?
More information on R
• https://www.r-project.org (R updates)
• https://rweekly.org (R news and tips & tricks)
• https://www.r-bloggers.com (R blogs)
• https://stackoverflow.com (R questions)
3
History
4
R for Data Science
5
R for Data Science boek
6
R and RStudio
7
RStudio
• User friendly
• Projects
• Cheatsheets
• Tab-completion
8
RStudio
9
R and R packages
10
Packages
.libPaths()
11
Installing and using packages
library(mice)
12
Installing packages
13
Mostly used packages
14
Tidyverse
install.packages(tidyverse)
library(tidyverse)
15
Base R
Console
• interpreted language
3 + 5
## [1] 8
pi + 3
## [1] 6.141593
exp(1)
## [1] 2.718282
16
Simple objects
a <- 1
2*a
## [1] 2
a + pi
## [1] 4.141593
a <- a + pi
a
## [1] 4.141593
## [1] 2 4 6 8 10 12 14 16 18 20
17
Working environment
18
Objects
19
Naming of objects
20
Functions
a <- c(3,10,14)
sum(a)
## [1] 27
b <- c(3,10,14,NA)
mean(b, na.rm = T)
## [1] 9
21
Help function
22
Programmering with R: code
23
R script
24
Example R-script
library(ggplot2) # package
25
Two common errors
ggplot(temp, aes(x,y))
26
Different kind of objects
27
Vectors
Numerical vectors
a <- 1
b <- c(100,3,46,-10,pi)
a_int <- c(1L,3L,5L,7L,9L)
row1 <- 1:5
row2 <- seq(from = 1, to = 10, by = 2)
28
Subsetting vectors
## [1] 3
b[3:4]
## [1] 46 -10
b[-1]
b[c(1,3)]
## [1] 100 46
29
Arithmetic operations
• Elementswise!
• Adding and subtracting: + and -
• Multiply, divide: * and /
• Exponentation: ˆ
• Root: sqrt
• Modulo dividing %/% and rest %%
30
Statistical functions
• Sum: sum
• Minimum and maximum: min and max
• Mean and median: mean and median
• Variance and standard deviation: var and sd
31
Examples:
a + b
a_int * b
sqrt(a_int)
row2ˆ3
sum(row1)
## [1] 15
32
Some more vectors
rep(1:3, each = 2)
## [1] 1 1 2 2 3 3
rep(1:3, times = 2)
## [1] 1 2 3 1 2 3
## [1] 1 5 1 5 1 5 2 2 4 4
length(y)
## [1] 10 33
Vectors: integer, numeric
a <- 1L
typeof(a)
## [1] "integer"
## [1] "integer"
## [1] "double"
class(vec2)
## [1] "numeric"
vec1 + vec2
34
## [1] 11.500000 6.200000 8.141593 11.000000
Character/string vectors
• Only characters
• between " or '
• One or more elements
• Can be made with the function c()
• Use \" for " and \' for '.
35
Working with strings
Package stringr
36
Some functions from stringr
37
Example stringr
library(stringr)
text <- "This is the 1st day of the R course.
This course has in total 5 lectures."
str_length(text)
## [1] 78
str_sub(text,3,10)
## [1] TRUE
38
## [1] "This is the 2nd day of the R course. \n This course has in total 5 l
Logical vectors
Logical vectors:
## [1] 2 1 2 0 1
b[vec7]
39
Logical operations
## [1] FALSE
F | T
## [1] TRUE
any(c(F,T,F))
## [1] TRUE
all(c(F,T,T))
40
Conditions
sign meaning
== equals
< smaller than (not equal)
> larger than (not equal)
<= smaller than or equal
>= larger or equal
x <- c(1,2,3, 4, 5, 6)
is_even <- x %% 2 == 0 # c(F,T,F, T, F, T)
is_threefold <- x %% 3 == 0 # c(F,F,T, F, F, T)
library(lubridate)
date_vec <- ymd(c("2000-9-14","2002-7-3",
"2004-4-14","2004-6-10"))
class(date_vec)
## [1] "Date"
yday(date_vec)
42
Combining different types
## [1] 1 2 0 1
43
Missing values
44
Type conversion
• as.numeric()
• as.character()
• class()
x <- c("1","100","102")
class(x)
as.numeric(x)
as.character(1:5)
45
Again different classes
library(ggplot2)
# create a table in R and name it "car"
car <- data.frame(
velocity = c(33.0, 33.0, 49.1, 65.2, 78.5, 93.0),
distance = c(4.69, 4.05, 10.3, 22.3, 34.4, 43.5))
class(car)
## [1] "data.frame"
## [1] "lm" 46
Tables
Import data (csv file) into R
library(readr)
data <- read_csv("data/births.csv")
head(data)
## # A tibble: 6 x 7
## provmin urban child_birth age_cat age etnicity pa
## <dbl> <chr> <chr> <chr> <dbl> <chr> <
## 1 68 strong first line child birth, at home 25-29 ~ 26 Dutch
## 2 12 moderate first line child birth, outpat~ 25-29 ~ 29 Dutch
## 3 99 not first line child birth, outpat~ 25-29 ~ 25 Mediter~
## 4 68 moderate during pregnacy referred to sp~ 30-34 ~ 30 Dutch 47
Importing data sets
• package readr
• via menu (in Environment button <Import Dataset>)
• more general: read_delim
• for SPSS files: read_sav (library haven)
• for Excel files: read_excel (library readxl)
48
Data type columns
• dbl: double
• chr: character
• date: date
• int: integer
• fct: factor (fixed number of levels)
49
Subsetting data
50
Data processing: some dplyr functions
library(dplyr)
## https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
## best introduction to dplyr
## (grammer for handling tables): https://dplyr.tidyverse.org
sel_data <- filter(data, parity > 10) ## filter some rows
data1 <- select(data,provmin, age_cat) ## select columns
data1 <- rename(data,birth = child_birth, ## rename column names
provcode = provmin)
data1 <- arrange(data, etnicity, urban) ## rearrange rows
data1 <- mutate(data, row_id = row_number()) ## add columns to table
51
More data manipulation
52
The pipe operator
53
Make graphs with ggplot2
Package ggplot2
• Not standard R
• Based on Grammar of Graphics
• Graph = Data + Layout + Coordinate system
• Graph can have more layers
• A layer has aesthetic (aes) properties coupled with properties
of data
• Handy cheatsheets:
https://www.rstudio.com/resources/cheatsheets/
54
Scatterplot example code
library(ggplot2)
55
Scatterplot example
Birthweigth and age mother
4000
Birth weight
2000
0 56
Scatterplot example code: split according to smoking
library(ggplot2)
57
Scatterplot split according to smoking
5000
4000
Birth weight
3000 Smoking
No
Yes
2000
1000
58
Histogram
library(ggplot2)
ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count.., fill = race)) +
labs(x = "Age mother",
y = "Number",
fill = "Race")
59
Histogram
20
Race
Number
White
Black
Other
10
0 60
Boxplot
library(ggplot2)
ggplot(data = lbw) +
geom_boxplot(aes(x = race, y = age, fill = smoke)) +
labs(x = "Race mother",
y = "Age mother",
fill = "Smoking")
61
Boxplot
40
Age mother
Smoking
30
No
Yes
20
62
Subplots
library(ggplot2)
ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count..)) +
labs(x = "Age mother",
y = "Number") +
facet_wrap( ~ race, nrow = 1)
63
Histogram split with smoking
White Black Other
6
Number
20 30 40 20 30 40 20 30 40
Age mother
64
Usefull packages
• topic DPV:
• dplyr
• lubridate
• readr
• tidyr
• ggplot2
• stringr
• topic DM:
• caret
• modelr
• rpart
• randomForest
65
General programming advice
66
Common errors
## Error in llm(dist ~ vel, data = temp): could not find function "llm"
ggplot(temp, aes(x,y))
67
More errors
car$velocity
## NULL
library(readr)
data <- read_delim(file = "data.csv", delim = ";",
locale = locale(encoding="ISO-8859-1"),
col_names = TRUE, col_types = NULL)
68
More errors
library(tree)
## (Intercept) velocity
## (Intercept) 7.0036820 -0.10417678
## velocity -0.1041768 0.00177675
## Error in tree(distance ~ velocity, data = car): could not find function "tre
vcov(model2)
70
End