0% found this document useful (0 votes)
9 views

R Course ISLR Basics 2023

lecture

Uploaded by

ozo1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

R Course ISLR Basics 2023

lecture

Uploaded by

ozo1996
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

R course for Applied Statistical Learning

Karin Groothuis-Oudshoorn
September 7, 2023

1
Program today

• Introduction R / RStudio
• Working with vectors
• Different object types / classes
• Working with tables
• Making graphs with ggplot2
• Common errors

2
Why R?

• Specially designed originally for statistics and data analysis


• Open-source
• Active (online) community (eg: stackoverflow)
• Packages

More information on R
• https://www.r-project.org (R updates)
• https://rweekly.org (R news and tips & tricks)
• https://www.r-bloggers.com (R blogs)
• https://stackoverflow.com (R questions)

3
History

• R is an open source implementation of the S language (Rick


Becker, Allan Wilks, John Chambers, Bell Labs, 1976)
• Developed by Ross Ihaka and Robert Gentlemen (1991,
University of Auckland)
• First official release in 1995
• Managed by the R Development Core Team
• First stable version 29 February 2000

4
R for Data Science

5
R for Data Science boek

6
R and RStudio

R: the engine RStudio: dashboard

7
RStudio

• User friendly
• Projects
• Cheatsheets
• Tab-completion

8
RStudio

9
R and R packages

R: a new phone R packages: Apps that you can


download

10
Packages

• More than 16000 packages currently available (see


https://cran.r-project.org)
• Extra functions, documentation and data to extend standard R
• Installed in library
• Where are they installed?:

.libPaths()

11
Installing and using packages

1. Install package: only once


2. Lading package: every session

library(mice)

3. Reinstall package: if you update R


4. List with default packages:
## which packages are default loaded?
search()

## [1] ".GlobalEnv" "package:stats" "package:graphics"


## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"

12
Installing packages

The easy way: in the lower right


panel of RStudio:
a) Click on the ’Packages’ tab
b) Click on ’Install’
c) Type the name of the
package under ’Packages’
d) Click on ’Install’
e) Alternatively: use the
function
‘install.packages("mice")‘

13
Mostly used packages

14
Tidyverse

• Developer: Hadley Wickham (van RStudio)


• Collection of packages: dplyr, ggplot2, tibble, readr,
tidyr, purrr, stringr, forcats
• More consistent than standard R
• A good starting point to learn R
• Is not standard R!
• Webpage: http://www.tidyverse.org

install.packages(tidyverse)
library(tidyverse)

15
Base R
Console

• interpreted language

3 + 5

## [1] 8

pi + 3

## [1] 6.141593

exp(1)

## [1] 2.718282

16
Simple objects

a <- 1
2*a

## [1] 2

a + pi

## [1] 4.141593

a <- a + pi
a

## [1] 4.141593

a_row <- 1:10


a_row2 <- 2*a_row
a_row2

## [1] 2 4 6 8 10 12 14 16 18 20

17
Working environment

18
Objects

• object contains the result of an assignment


• object can be used in a new assignment
• object you can make with <- (or with =)
• object will be saved in global environment (so upper right
panel)
• object global environment will be deleted if you quit
R/Rstudio (except if you save it)

19
Naming of objects

• the name of an object starts with a letter


• the name of an object contains only letters, numbers, _ and .
• small letters and capital letters are NOT the same
• don’s use spaces in names

20
Functions

• a function has a name and arguments


• the arguments are between parentheses ().
• use no space between function name and the parentheses!
• use the help() function for more info of a function

a <- c(3,10,14)
sum(a)

## [1] 27

b <- c(3,10,14,NA)
mean(b, na.rm = T)

## [1] 9

21
Help function

• see right lower panel


• in the console: help(mean)
• help description has standard form:
• description
• usage
• arguments
• details
• value
• references
• examples

22
Programmering with R: code

• Console vs. script editor


• New command with ; or a new line
• R is case sensitive
• # for comments

23
R script

• Document with lines of assignments / code


• Extension of the file is .R
• Comments can be written in it starting with a hashtag (#)
• Run the code with the Run-button (in the middle/ above in
RStudio), or:
• Shortcut: ctrl + Enter
• Output will be shown in the console
• Advances reproducibility

24
Example R-script

library(ggplot2) # package

# create a table in R and name it "car"


car <- data.frame(
velocity = c(33.0, 33.0, 49.1, 65.2, 78.5, 93.0),
distance = c(4.69, 4.05, 10.3, 22.3, 34.4, 43.5))

# plot the data and draw a line


ggplot(data = car, aes(x = velocity, y = distance)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)

model <- lm(distance ~ velocity, data = car )


model

25
Two common errors

model <- lm(distance ~ velocity, data = temp )

## Error in is.data.frame(data): object ’temp’ not found

ggplot(temp, aes(x,y))

## Error in ggplot(temp, aes(x, y)): could not find function "ggplot"

26
Different kind of objects

• vector (character, integer, logical, double, date)


• matrix
• table: dataframes / tibbles
• factor variables
• list
• plot
• etcetera . . .
• own defined object

27
Vectors
Numerical vectors

• Contains only integers or doubles (numerical values)


• One or more elements
• Can be made with the function c()
• Series from 1 to n: seq(1, n) or 1:n.

a <- 1
b <- c(100,3,46,-10,pi)
a_int <- c(1L,3L,5L,7L,9L)
row1 <- 1:5
row2 <- seq(from = 1, to = 10, by = 2)

28
Subsetting vectors

• Index starts with 1


• Use of [ and ] brackets
• Result is also vector
b[2]

## [1] 3

b[3:4]

## [1] 46 -10

b[-1]

## [1] 3.000000 46.000000 -10.000000 3.141593

b[c(1,3)]

## [1] 100 46

29
Arithmetic operations

• Elementswise!
• Adding and subtracting: + and -
• Multiply, divide: * and /
• Exponentation: ˆ
• Root: sqrt
• Modulo dividing %/% and rest %%

30
Statistical functions

• Sum: sum
• Minimum and maximum: min and max
• Mean and median: mean and median
• Variance and standard deviation: var and sd

31
Examples:

a + b

## [1] 101.000000 4.000000 47.000000 -9.000000 4.141593

a_int * b

## [1] 100.00000 9.00000 230.00000 -70.00000 28.27433

sqrt(a_int)

## [1] 1.000000 1.732051 2.236068 2.645751 3.000000

row2ˆ3

## [1] 1 27 125 343 729

sum(row1)

## [1] 15

32
Some more vectors

x <- seq(from = 0, to = 1, by = 0.05)


x[1:5]

## [1] 0.00 0.05 0.10 0.15 0.20

rep(1:3, each = 2)

## [1] 1 1 2 2 3 3

rep(1:3, times = 2)

## [1] 1 2 3 1 2 3

y <- c( rep(c(1,5), times = 3), rep(c(2,4), each = 2))


y

## [1] 1 5 1 5 1 5 2 2 4 4

length(y)

## [1] 10 33
Vectors: integer, numeric

a <- 1L
typeof(a)

## [1] "integer"

vec1 <- c(1L,3L,5L,7L)


typeof(vec1)

## [1] "integer"

vec2 <- c(10.5,3.2,pi,4L)


typeof(vec2)

## [1] "double"

class(vec2)

## [1] "numeric"

vec1 + vec2

34
## [1] 11.500000 6.200000 8.141593 11.000000
Character/string vectors

• Only characters
• between " or '
• One or more elements
• Can be made with the function c()
• Use \" for " and \' for '.

vec3 <- c("low","low","high","medium")


vec4 <- c("Jeroen Bosch","van Gogh")
vec5 <- c("'s Hertogenbosch",'\'s-Gravenhage')
vec6 <- c("1","2","3","4")

35
Working with strings
Package stringr

• Not part of tidyverse, so install seperately


• See chapter 11 of R for Data Science
• More info in: help(package = "stringr")

36
Some functions from stringr

• str_length, length of a string


• str_c, concatenate of strings
• str_trim, remove spaces at the beginning and end of strings
• str_pad, padding of string
• str_sub, select part of string
• str_detect and str_replace, find and replace.

37
Example stringr

library(stringr)
text <- "This is the 1st day of the R course.
This course has in total 5 lectures."
str_length(text)

## [1] 78

str_sub(text,3,10)

## [1] "is is th"

str_extract(text, pattern = "is the")

## [1] "is the"

str_detect(text, pattern = "course")

## [1] TRUE

str_replace(text, pattern = "1st", replacement = "2nd")

38
## [1] "This is the 2nd day of the R course. \n This course has in total 5 l
Logical vectors
Logical vectors:

vec7 <- c(TRUE,FALSE,TRUE,FALSE, TRUE)


vec8 <- c(T,T,T,F,F)
vec7 + vec8

## [1] 2 1 2 0 1

b[vec7]

## [1] 100.000000 46.000000 3.141593

39
Logical operations

• A and B both true: A & B


• A or B true: A | B
• A not true: ! A
• Minimal one element of A is true: any(A)
• All elements of A are true: all(A)
F & T

## [1] FALSE

F | T

## [1] TRUE

any(c(F,T,F))

## [1] TRUE

all(c(F,T,T))
40
Conditions

sign meaning
== equals
< smaller than (not equal)
> larger than (not equal)
<= smaller than or equal
>= larger or equal

x <- c(1,2,3, 4, 5, 6)
is_even <- x %% 2 == 0 # c(F,T,F, T, F, T)
is_threefold <- x %% 3 == 0 # c(F,F,T, F, F, T)

is_even & is_threefold # c(F,F,F, F, F, T)


is_even | is_threefold # c(F,T,T, T, F, T)
! is_even # c(T,F,T, F, T, F)

x[x > 3] # [1] 4 5 6


41
Date vectors

library(lubridate)
date_vec <- ymd(c("2000-9-14","2002-7-3",
"2004-4-14","2004-6-10"))
class(date_vec)

## [1] "Date"

yday(date_vec)

## [1] 258 184 105 162

42
Combining different types

• All elements of a vector are of the same type


• When combining by concatenating:
• Logical values turn into numerical or text
• Numbers turn into text.

vec_com1 <- c(1,3,4,"Hello")


vec_com1

## [1] "1" "3" "4" "Hello"

vec_com2 <- c(c(1,2),c(F,T))


vec_com2

## [1] 1 2 0 1

43
Missing values

• no user-defined missing value


• default missing value is denoted with NA
• is.na gives elementswise Y/N for missing values
• functions may have a special attribute for missing values like
na.rm = TRUE

x <- c(1, 2, NA, 4, 5)


is.na(x)
x + 1
sum(x)
sum(x, na.rm = TRUE)
x[is.na(x)] <- 3

44
Type conversion

• as.numeric()
• as.character()
• class()

x <- c("1","100","102")
class(x)
as.numeric(x)
as.character(1:5)

45
Again different classes

library(ggplot2)
# create a table in R and name it "car"
car <- data.frame(
velocity = c(33.0, 33.0, 49.1, 65.2, 78.5, 93.0),
distance = c(4.69, 4.05, 10.3, 22.3, 34.4, 43.5))
class(car)

## [1] "data.frame"

# plot the data and draw a line


p1 <- ggplot(data = car, aes(x = velocity, y = distance)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
class(p1)

## [1] "gg" "ggplot"

model <- lm(distance ~ velocity, data = car )


class(model)

## [1] "lm" 46
Tables
Import data (csv file) into R

library(readr)
data <- read_csv("data/births.csv")

## Rows: 49703 Columns: 7


## -- Column specification ----------------------------------------------------
## Delimiter: ","
## chr (4): urban, child_birth, age_cat, etnicity
## dbl (3): provmin, age, parity
##
## i Use ‘spec()‘ to retrieve the full column specification for this data.
## i Specify the column types or set ‘show_col_types = FALSE‘ to quiet this mes

head(data)

## # A tibble: 6 x 7
## provmin urban child_birth age_cat age etnicity pa
## <dbl> <chr> <chr> <chr> <dbl> <chr> <
## 1 68 strong first line child birth, at home 25-29 ~ 26 Dutch
## 2 12 moderate first line child birth, outpat~ 25-29 ~ 29 Dutch
## 3 99 not first line child birth, outpat~ 25-29 ~ 25 Mediter~
## 4 68 moderate during pregnacy referred to sp~ 30-34 ~ 30 Dutch 47
Importing data sets

• package readr
• via menu (in Environment button <Import Dataset>)
• more general: read_delim
• for SPSS files: read_sav (library haven)
• for Excel files: read_excel (library readxl)

48
Data type columns

• dbl: double
• chr: character
• date: date
• int: integer
• fct: factor (fixed number of levels)

49
Subsetting data

• Use of $ for columns


• Select parts of vectors / tables with [ and ]

data$urban # select the column urban from table data


data[,1] # select the first column from table data
data[1,] # select the first column from table data
data[1:2,3:4] # select the first two rows and column 3 and 4
data[data$parity > 10,] # select rows with parity > 10

50
Data processing: some dplyr functions

library(dplyr)
## https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html
## best introduction to dplyr
## (grammer for handling tables): https://dplyr.tidyverse.org
sel_data <- filter(data, parity > 10) ## filter some rows
data1 <- select(data,provmin, age_cat) ## select columns
data1 <- rename(data,birth = child_birth, ## rename column names
provcode = provmin)
data1 <- arrange(data, etnicity, urban) ## rearrange rows
data1 <- mutate(data, row_id = row_number()) ## add columns to table

51
More data manipulation

## summarise the total cost per country


data_tot <- group_by(data,etnicity) ## analyse by group
data_sum <- summarise(data_tot, ave_age = mean(age)) ## summarise

52
The pipe operator

• A way of chaining commands next to each other


• You can read it as and then
• Package magittr (automatically loaded with dplyr)
gapminder %>%
filter(continent == "Asia") %>%
summarise(mean_exp = mean(lifeExp))
# without pipe
temp <- filter(gapminder, continent == "Asia")
temp <- summarise(temp, mean_exp = mean(lifeExp))

53
Make graphs with ggplot2
Package ggplot2

• Not standard R
• Based on Grammar of Graphics
• Graph = Data + Layout + Coordinate system
• Graph can have more layers
• A layer has aesthetic (aes) properties coupled with properties
of data
• Handy cheatsheets:
https://www.rstudio.com/resources/cheatsheets/

54
Scatterplot example code

library(ggplot2)

ggplot(lbw, aes(x = age, y = bwt)) +


geom_point() +
geom_smooth(method = "lm", se = FALSE) +
coord_cartesian(xlim = c(15,46), ylim = c(0, 5500)) +
labs(title = "Birthweigth and age mother",
x = "Age mother",
y = "Birth weight")

55
Scatterplot example
Birthweigth and age mother

4000
Birth weight

2000

0 56
Scatterplot example code: split according to smoking

• aes(): col, shape, size

library(ggplot2)

ggplot(lbw, aes(x = age, y = bwt)) +


geom_point(aes(col = smoke)) +
geom_smooth(method = "lm", se = FALSE) +
labs( x = "Age mother",
y = "Birth weight",
color = "Smoking")

57
Scatterplot split according to smoking

5000

4000
Birth weight

3000 Smoking
No
Yes

2000

1000

58
Histogram

library(ggplot2)

ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count.., fill = race)) +
labs(x = "Age mother",
y = "Number",
fill = "Race")

59
Histogram

20

Race
Number

White
Black
Other

10

0 60
Boxplot

library(ggplot2)

ggplot(data = lbw) +
geom_boxplot(aes(x = race, y = age, fill = smoke)) +
labs(x = "Race mother",
y = "Age mother",
fill = "Smoking")

61
Boxplot

40
Age mother

Smoking
30
No
Yes

20

62
Subplots

library(ggplot2)

ggplot(data = lbw) +
geom_histogram(aes(x = age, y = ..count..)) +
labs(x = "Age mother",
y = "Number") +
facet_wrap( ~ race, nrow = 1)

63
Histogram split with smoking
White Black Other

6
Number

20 30 40 20 30 40 20 30 40
Age mother

64
Usefull packages

• topic DPV:
• dplyr
• lubridate
• readr
• tidyr
• ggplot2
• stringr
• topic DM:
• caret
• modelr
• rpart
• randomForest

65
General programming advice

• Work in a syntax file


• One step at a time
• Check result every step (e.g. with head(data))
• Add comments to your code
• Place necessary libraries at the beginning of the code
• If you encounter errors: rerun from the start
• Carefully read the error message
• GIYF: Google is your friend

66
Common errors

model <- lm(dist ~ vel, data = temp )

## Error in is.data.frame(data): object ’temp’ not found

llm(dist ~ vel, data = temp )

## Error in llm(dist ~ vel, data = temp): could not find function "llm"

ggplot(temp, aes(x,y))

## Error in ggplot(temp, aes(x, y)): object ’temp’ not found

67
More errors

car <- data.frame(


vel = c(33.0, 33.0, 49.1, 65.2, 78.5, 93.0),
dist = c(4.69, 4.05, 10.3, 22.3, 34.4, 43.5))
car[,1:3]

## Error in ‘[.data.frame‘(car, , 1:3): undefined columns selected

car$velocity

## NULL

library(readr)
data <- read_delim(file = "data.csv", delim = ";",
locale = locale(encoding="ISO-8859-1"),
col_names = TRUE, col_types = NULL)

## Error: ’data.csv’ does not exist in current working directory (’C:/Users/Mik

68
More errors

library(tree)

## Error in library(tree): there is no package called ’tree’

car <- data.frame(


velocity = c(33.0, 33.0, 49.1, 65.2, 78.5, 93.0),
distance = c(4.69, 4.05, 10.3, 22.3, 34.4, 43.5))
model1 <- lm(distance ~ velocity, data = car)
vcov(model1)

## (Intercept) velocity
## (Intercept) 7.0036820 -0.10417678
## velocity -0.1041768 0.00177675

model2 <- tree(distance ~ velocity, data = car)

## Error in tree(distance ~ velocity, data = car): could not find function "tre

vcov(model2)

## Error in vcov(model2): object ’model2’ not found 69


RStudio Project

• A directory on the hard disk


• Put scripts and data in a project directory
• A project directory is a working directory for R
• RStudio places some standard files in a project directory
• You can make a project directory with File > New Project
• You can open an existing project directory with File > Open
Project
• Data on a project will be saved in the file
<jouw_projectnaam>.Rproj.

70
End

You might also like