SlideShare a Scribd company logo
Tackling repetitive tasks with serial
or parallel programming in R
Speaker: Lun-Hsien CHANG
Affiliation: Immunology in Cancer and Infection, QIMR Berghofer
Meeting: R user group meeting #14
Time: 1-2 PM, 28th July 2020
Place: Seminar room, Level 6, Central building, QIMR, Brisbane
It is the central dogma but….
Even your program is working
fine, you may still want to
● Try a faster R package
than the current one
● Rewrite code that is less
error-prone
● Revise code for simplicity
and efficiency
Outline
R programming basics
● Syntax forms, data structure, vector, elapsed time
Serial computing
● For loop, vectorised functions, *apply() functions
Parallel computing
● The doParallel, parallel, foreach package
Compare time performance in serial and parallel computing
Common syntax forms in a R program
# Comments preceded by a hash
# Assign value "A.1" to variable.1
variable.1 <- "A.1"
library(package.A)
# Use function.A from package.A
function.A( argument1=values
,argument2=values
,...)
# Use function.A from package.A
package.A::function.A( argument1=values
,argument2=values
,...)
Data structure in R
What is a vector in R?
A vector is a one-dimensioned collection of numbers, characters or logicals
v1 <- c(1:5)
v1
# [1] 1 2 3 4 5
v2 <- c("a","b","c","d","e")
v2
# [1] "a" "b" "c" "d" "e"
v3 <- c(TRUE, TRUE, FALSE, FALSE, TRUE)
v3
# [1] TRUE TRUE FALSE FALSE TRUE
v4 <- c(1, "a", TRUE, 4, "b")
v4
#[1] "1" "a" "TRUE" "4" "b"
What is elapsed time in R?
User time : defined by your operating system (OS)
System time : defined by your OS
Elapsed time : the amount of time that passes from the start of a program to its
finish
Start.time <- proc.time()
# run some R code
End.time <- proc.time() - Start.time
system.time(# run some R code)
# user system elapsed
# 0.4 0.1 132.2
Serial computing in R
What is serial (sequential) computing?
Runs on a single CPU core, solving one task at a time
Ideal for dependent tasks (e.g. Task 2 uses result from task 1)
Run time is a function of the number of tasks
Task 4Task 3Task 2Task 1
Time
Single-core
processor (CPU)
R functions that run serial computing
● for loop
● Vectorised functions
○ Most R functions taks a vector usually in their first arguments
○ Few R functions take a single value (e.g. dir.create() )
● lapply(), sapply() from the apply family
Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
Create a variable i with values 1 to 10
Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
Take each i value and do something using it
Syntax form of a for loop in R
for (i in 1:10){
Command1
Command2
...
} Close the for loop with }
Syntax forms of a for loop in R
for (i in 1:10){
Command1
Command2
...
}
for(i in c(1:10)){
Command1
Command2
...
}
This works
This works too
Vectorised operations in R
Many operations are vectorised in R, meaning that operations occur in all
elements of a vector in parallel
Task : Look up JPG images in 3 folders and get their file paths
dir.1 <- "C:/images"
dir.2 <- "D:/images"
dir.3 <- "E:/images"
list.files(path = c( dir.1, dir.2, dir.3)
,pattern = ".*.jpg"
,full.names = TRUE )
The *apply() functions
● Examples: lapply(X=, FUN=), sapply(X=, FUN=)
● Use them when the function to apply is simple
● Misconception: These are internal loops. They apply a function (FUN=) to all
the elements of a vector or list (X=). They are not faster than a for loop!
The *apply() functions
● Task: Create 3 folders under C:/images
# Specify the full path of new folders
new.folder.1 <- "C:/images/JPG"
new.folder.2 <- "C:/images/TIF"
new.folder.3 <- "C:/images/PNG"
# Create new folders using dir.create()
lapply( X= c( new.folder.1
,new.folder.2
,new.folder.3)
,FUN = function(x) dir.create(x, recursive = TRUE))
An unnecessary usage of lapply()
Tasks: check to see if 3 image folders exist
# Check the existence of 3 image
folders by lapply()
unlist(lapply(X=c( new.folder.1
,new.folder.2
,new.folder.3)
,FUN = function(x)
dir.exists(x))
)
# [1] TRUE TRUE TRUE TRUE
# By vectorised operation
dir.exists(paths = c( new.folder.1
,new.folder.2
,new.folder.3)
)
# [1] TRUE TRUE TRUE TRUE
Task: read multiple text files to a single data frame
with lapply()
Specify paths of input folders
Check to see if these input folders exist
Get full paths of input txt files
Read these files to a list
Concatenate these files as a single data frame
Read multiple text files to a single data frame (1/4)
# Specify full paths of data folders
drive.dir.C <- 'C:/Lab_MarkS'
input.data.dir <- file.path(drive.dir.C,"lunC/Immunohistochemistry_images/data_output")
input.data.folder.1 <- file.path(input.data.dir, "MT_Exp023.2_18-001-A","analysis-results")
input.data.folder.2 <- file.path(input.data.dir, "MT_Exp023.2_18-001-B","analysis-results")
input.data.folder.3 <- file.path(input.data.dir, "MT_Exp023.2_18-001-C","analysis-results")
input.data.folder.4 <- file.path(input.data.dir, "MT_Exp023.2_18-002-A","analysis-results")
input.data.folder.5 <- file.path(input.data.dir, "MT_Exp023.2_18-002-B","analysis-results")
Read multiple text files to a single data frame (2/4)
# Check if input folders exist
dir.exists(path=c( input.data.folder.1
,input.data.folder.2
,input.data.folder.3
,input.data.folder.4
,input.data.folder.5))
Read multiple text files to a single data frame (3/4)
# Get full paths of input files
input.data.file.paths <- list.files(path = c( input.data.folder.1
,input.data.folder.2
,input.data.folder.3
,input.data.folder.4
,input.data.folder.5)
,pattern =
".*cell-segmentation-summary_long-format_based-on-merged-cell-seg-file.tsv
"
,full.names = TRUE ) #
length(input.data.file.paths) 5
Read multiple text files to a single data frame (4/4)
# Read multiple tsv files to a list
input.data.list <- lapply( X=input.data.file.paths
,FUN= function(x) read.delim( file=x
,header = TRUE
,stringsAsFactors = F)
) # class(input.data.list) "list" # length(input.data.list) 5
# Combine list elements to a single data.frame
input.data.read <- do.call(what = "rbind", args = input.data.list) #
dim(input.data.read) 375 14
Parallel computing
What is parallel computing
Task 4
Task 3
Task 2
Task 1
TimeMulti-core
processor
Runs on multiple CPU cores, solving
tasks in parallel (simultaneously).
Ideal for independent tasks (i.e. Task
2 does not rely on the result from task
1)
Run time < serial computing
Parallelised programming in R
Use it when you run a batch of similar tasks that are independent of each other
● Call an R script in a Shell script multiple times in a super computer
● doParallel, parallel, foreach packages in a local computer
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Load required packages into R (Windows users)
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Find number of CPU cores in your computer
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Use 2 CPU cores for R,
leave the other 2 for
software running in the
background
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Register the cluster
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
Specify arguments in a foreach loop
for (i in 1:10){
Command1
Command2
...
}
# Load required packages
library(doParallel)
library(parallel)
library(foreach)
# Detect number of CPU cores in your computer
parallel::detectCores() # 4 cores detected
# Set up a backend with 2 CPU cores
cluster <- parallel::makeCluster(parallel::detectCores() -2 )
doParallel::registerDoParallel(cluster)
# foreach general form
foreach( i=1:10
,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{
Command.1
Command.2
... }
Syntax forms of multiple CPU cores and 1 foreach loop
for (i in 1:10){
Command1
Command2
...
}
Parallelise tasks with
%dopar%
Syntax form of nested foreach loops
# nested foreach general form
foreach( i=1:10
,.combine = 'rbind')%:%
foreach(j=1:5
,.combine = 'rbind'
,.packages =
c("package.A","package.B"))%dopar%{
command.1
command.2
}
%:% concatenates the outer and inner
foreach loop
Parallelise computation using %dopar%
Serial versus parallel
computing
Compare time used by vectorised & parallel computing
The testing tool: a birthday simulator
● A function to calculate the probability of having at least 2 people with same
birthdays given N people in the same room
● Returns N probabilities
The timing tool: system.time()
Compare time used by vectorised & parallel computing
The birthday simulation function
# Birthday problem simulator
pbirthdaysim <- function(n){
## n: number of people in the room
## ntests: number of simulations and averaging the
results
ntests <- 100000
pop <- 1:365
anydup <- function(i)
any(duplicated(
sample(pop, n, replace=TRUE)))
sum(sapply(seq(ntests), anydup)) / ntests
}
Compare the time used by vectorised and parallel
computing
system.time( # run birthday simulator using lapply())
system.time( # run birthday simulator using sapply())
system.time( # run birthday simulator using a for loop)
system.time( # run birthday simulator using 1 CPU core and foreach loop)
system.time( # run birthday simulator using all CPU cores and 1 foreach
loop)
Timing serial and parallel programming
Testing conditions:
● Dell E7440 laptop (Intel Core i5-4300U 2 x 1.9 - 2.9 GHz, Haswell.)
● 1 million simulations
Function Elapsed time
lapply
sapply
For loop
Foreach + 1 CPU core
Foreach + all CPU cores detected
sessionInfo()
# R version 4.0.0 (2020-04-24)
# Platform: x86_64-w64-mingw32/x64
(64-bit)
# Running under: Windows 7 x64 (build
7601) Service Pack 1
Don’t hesitate to ask yourself
● What is the time
performance of my
working code?
● Can I replace a loop with
vectorised functions?
● If my computing tasks are
independent, why haven’t
I used multiple CPU cores
and parallelised
computing?
Q & A
My Qs:
How many CPU cores detected in your computer?
What are the elapsed times running the birthday simulator in your R?
Your Qs?
Serial and parallel processing in real world
https://slideplayer.com/slide/7066858/
What is a CPU core?
A core, or CPU core, is the actual hardware component.
It is the "brain" of a CPU. It receives instructions, and performs calculations, or
operations, to satisfy those instructions. A CPU can have multiple cores.
A processor with two cores is called a dual-core processor; with four cores, a
quad-core; six cores, hexa-core; eight cores, octa-core.
As of 2019, the majority of consumer CPUs feature between 2 and 12 cores.

More Related Content

What's hot (20)

PPTX
Python 3.6 Features 20161207
Jay Coskey
 
PPTX
Python programming: Anonymous functions, String operations
Megha V
 
PDF
Day3
Karin Lagesen
 
ODP
Day2
Karin Lagesen
 
ODP
Python course Day 1
Karin Lagesen
 
PDF
C interview-questions-techpreparation
Kushaal Singla
 
PDF
Why we cannot ignore Functional Programming
Mario Fusco
 
PDF
4. python functions
in4400
 
PPTX
Introduction to the basics of Python programming (part 3)
Pedro Rodrigues
 
PDF
Practical Functional Programming Presentation by Bogdan Hodorog
3Pillar Global
 
PPTX
Regular expressions in Python
Sujith Kumar
 
PPTX
Introduction to Python and TensorFlow
Bayu Aldi Yansyah
 
PPTX
Functional programming seminar (haskell)
Bikram Thapa
 
PPTX
Dynamic memory allocation in c++
Tech_MX
 
PPTX
A brief introduction to lisp language
David Gu
 
PDF
Python Workshop. LUG Maniapl
Ankur Shrivastava
 
PPT
9780538745840 ppt ch03
Terry Yoast
 
PPTX
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
 
PPTX
Functions in python
Santosh Verma
 
Python 3.6 Features 20161207
Jay Coskey
 
Python programming: Anonymous functions, String operations
Megha V
 
Python course Day 1
Karin Lagesen
 
C interview-questions-techpreparation
Kushaal Singla
 
Why we cannot ignore Functional Programming
Mario Fusco
 
4. python functions
in4400
 
Introduction to the basics of Python programming (part 3)
Pedro Rodrigues
 
Practical Functional Programming Presentation by Bogdan Hodorog
3Pillar Global
 
Regular expressions in Python
Sujith Kumar
 
Introduction to Python and TensorFlow
Bayu Aldi Yansyah
 
Functional programming seminar (haskell)
Bikram Thapa
 
Dynamic memory allocation in c++
Tech_MX
 
A brief introduction to lisp language
David Gu
 
Python Workshop. LUG Maniapl
Ankur Shrivastava
 
9780538745840 ppt ch03
Terry Yoast
 
Introduction to the basics of Python programming (part 1)
Pedro Rodrigues
 
Functions in python
Santosh Verma
 

Similar to Tackling repetitive tasks with serial or parallel programming in R (20)

PPTX
Using R on High Performance Computers
Dave Hiltbrand
 
PDF
Parallel Computing with R
Abhirup Mallik
 
PDF
R workshop xx -- Parallel Computing with R
Vivian S. Zhang
 
PDF
Data Analysis with R (combined slides)
Guy Lebanon
 
PDF
A Future for R: Parallel and Distributed Processing in R for Everyone
inside-BigData.com
 
PDF
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
PDF
RDataMining slides-r-programming
Yanchang Zhao
 
KEY
R for Pirates. ESCCONF October 27, 2011
Mandi Walls
 
PDF
R - the language
Mike Martinez
 
PPT
Loops and functions in r
manikanta361
 
PPTX
A quick introduction to R
Angshuman Saha
 
PDF
Introduction to R programming
Alberto Labarga
 
PDF
Day 4a iteration and functions.pptx
Adrien Melquiond
 
PPT
R Basics
AllsoftSolutions
 
PDF
Parallel R
Matt Moores
 
DOCX
R Language
ShwetDadhaniya1
 
PPTX
BA lab1.pptx
sherifsalem24
 
PPTX
世预赛买球-世预赛买球下注-世预赛买球下注平台|【​网址​🎉ac123.net🎉​】
irisvladislava756
 
PPTX
欧洲杯投注-欧洲杯投注投注官方网站-欧洲杯投注买球投注官网|【​网址​🎉ac99.net🎉​】
mukeshomran942
 
PPTX
美洲杯买球-美洲杯买球下注平台-美洲杯买球投注平台|【​网址​🎉ac55.net🎉​】
ahmedendrise81
 
Using R on High Performance Computers
Dave Hiltbrand
 
Parallel Computing with R
Abhirup Mallik
 
R workshop xx -- Parallel Computing with R
Vivian S. Zhang
 
Data Analysis with R (combined slides)
Guy Lebanon
 
A Future for R: Parallel and Distributed Processing in R for Everyone
inside-BigData.com
 
St Petersburg R user group meetup 2, Parallel R
Andrew Bzikadze
 
RDataMining slides-r-programming
Yanchang Zhao
 
R for Pirates. ESCCONF October 27, 2011
Mandi Walls
 
R - the language
Mike Martinez
 
Loops and functions in r
manikanta361
 
A quick introduction to R
Angshuman Saha
 
Introduction to R programming
Alberto Labarga
 
Day 4a iteration and functions.pptx
Adrien Melquiond
 
Parallel R
Matt Moores
 
R Language
ShwetDadhaniya1
 
BA lab1.pptx
sherifsalem24
 
世预赛买球-世预赛买球下注-世预赛买球下注平台|【​网址​🎉ac123.net🎉​】
irisvladislava756
 
欧洲杯投注-欧洲杯投注投注官方网站-欧洲杯投注买球投注官网|【​网址​🎉ac99.net🎉​】
mukeshomran942
 
美洲杯买球-美洲杯买球下注平台-美洲杯买球投注平台|【​网址​🎉ac55.net🎉​】
ahmedendrise81
 
Ad

Recently uploaded (20)

PPTX
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
PPTX
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PPTX
BinarySearchTree in datastructures in detail
kichokuttu
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PPTX
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PDF
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PDF
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
PPTX
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PPTX
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
PPTX
What Is Data Integration and Transformation?
subhashenia
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
apidays Helsinki & North 2025 - Running a Successful API Program: Best Practi...
apidays
 
SlideEgg_501298-Agentic AI.pptx agentic ai
530BYManoj
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
BinarySearchTree in datastructures in detail
kichokuttu
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
apidays Helsinki & North 2025 - API access control strategies beyond JWT bear...
apidays
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
big data eco system fundamentals of data science
arivukarasi
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Building a Federated Future, Alex Szomora (GSMA)
apidays
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
apidays Singapore 2025 - Trustworthy Generative AI: The Role of Observability...
apidays
 
apidays Singapore 2025 - The Quest for the Greenest LLM , Jean Philippe Ehre...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
05_Jelle Baats_Tekst.pptx_AI_Barometer_Release_Event
FinTech Belgium
 
What Is Data Integration and Transformation?
subhashenia
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
The Best NVIDIA GPUs for LLM Inference in 2025.pdf
Tamanna36
 
Ad

Tackling repetitive tasks with serial or parallel programming in R

  • 1. Tackling repetitive tasks with serial or parallel programming in R Speaker: Lun-Hsien CHANG Affiliation: Immunology in Cancer and Infection, QIMR Berghofer Meeting: R user group meeting #14 Time: 1-2 PM, 28th July 2020 Place: Seminar room, Level 6, Central building, QIMR, Brisbane
  • 2. It is the central dogma but…. Even your program is working fine, you may still want to ● Try a faster R package than the current one ● Rewrite code that is less error-prone ● Revise code for simplicity and efficiency
  • 3. Outline R programming basics ● Syntax forms, data structure, vector, elapsed time Serial computing ● For loop, vectorised functions, *apply() functions Parallel computing ● The doParallel, parallel, foreach package Compare time performance in serial and parallel computing
  • 4. Common syntax forms in a R program # Comments preceded by a hash # Assign value "A.1" to variable.1 variable.1 <- "A.1" library(package.A) # Use function.A from package.A function.A( argument1=values ,argument2=values ,...) # Use function.A from package.A package.A::function.A( argument1=values ,argument2=values ,...)
  • 6. What is a vector in R? A vector is a one-dimensioned collection of numbers, characters or logicals v1 <- c(1:5) v1 # [1] 1 2 3 4 5 v2 <- c("a","b","c","d","e") v2 # [1] "a" "b" "c" "d" "e" v3 <- c(TRUE, TRUE, FALSE, FALSE, TRUE) v3 # [1] TRUE TRUE FALSE FALSE TRUE v4 <- c(1, "a", TRUE, 4, "b") v4 #[1] "1" "a" "TRUE" "4" "b"
  • 7. What is elapsed time in R? User time : defined by your operating system (OS) System time : defined by your OS Elapsed time : the amount of time that passes from the start of a program to its finish Start.time <- proc.time() # run some R code End.time <- proc.time() - Start.time system.time(# run some R code) # user system elapsed # 0.4 0.1 132.2
  • 9. What is serial (sequential) computing? Runs on a single CPU core, solving one task at a time Ideal for dependent tasks (e.g. Task 2 uses result from task 1) Run time is a function of the number of tasks Task 4Task 3Task 2Task 1 Time Single-core processor (CPU)
  • 10. R functions that run serial computing ● for loop ● Vectorised functions ○ Most R functions taks a vector usually in their first arguments ○ Few R functions take a single value (e.g. dir.create() ) ● lapply(), sapply() from the apply family
  • 11. Syntax form of a for loop in R for (i in 1:10){ Command1 Command2 ... } Create a variable i with values 1 to 10
  • 12. Syntax form of a for loop in R for (i in 1:10){ Command1 Command2 ... } Take each i value and do something using it
  • 13. Syntax form of a for loop in R for (i in 1:10){ Command1 Command2 ... } Close the for loop with }
  • 14. Syntax forms of a for loop in R for (i in 1:10){ Command1 Command2 ... } for(i in c(1:10)){ Command1 Command2 ... } This works This works too
  • 15. Vectorised operations in R Many operations are vectorised in R, meaning that operations occur in all elements of a vector in parallel Task : Look up JPG images in 3 folders and get their file paths dir.1 <- "C:/images" dir.2 <- "D:/images" dir.3 <- "E:/images" list.files(path = c( dir.1, dir.2, dir.3) ,pattern = ".*.jpg" ,full.names = TRUE )
  • 16. The *apply() functions ● Examples: lapply(X=, FUN=), sapply(X=, FUN=) ● Use them when the function to apply is simple ● Misconception: These are internal loops. They apply a function (FUN=) to all the elements of a vector or list (X=). They are not faster than a for loop!
  • 17. The *apply() functions ● Task: Create 3 folders under C:/images # Specify the full path of new folders new.folder.1 <- "C:/images/JPG" new.folder.2 <- "C:/images/TIF" new.folder.3 <- "C:/images/PNG" # Create new folders using dir.create() lapply( X= c( new.folder.1 ,new.folder.2 ,new.folder.3) ,FUN = function(x) dir.create(x, recursive = TRUE))
  • 18. An unnecessary usage of lapply() Tasks: check to see if 3 image folders exist # Check the existence of 3 image folders by lapply() unlist(lapply(X=c( new.folder.1 ,new.folder.2 ,new.folder.3) ,FUN = function(x) dir.exists(x)) ) # [1] TRUE TRUE TRUE TRUE # By vectorised operation dir.exists(paths = c( new.folder.1 ,new.folder.2 ,new.folder.3) ) # [1] TRUE TRUE TRUE TRUE
  • 19. Task: read multiple text files to a single data frame with lapply() Specify paths of input folders Check to see if these input folders exist Get full paths of input txt files Read these files to a list Concatenate these files as a single data frame
  • 20. Read multiple text files to a single data frame (1/4) # Specify full paths of data folders drive.dir.C <- 'C:/Lab_MarkS' input.data.dir <- file.path(drive.dir.C,"lunC/Immunohistochemistry_images/data_output") input.data.folder.1 <- file.path(input.data.dir, "MT_Exp023.2_18-001-A","analysis-results") input.data.folder.2 <- file.path(input.data.dir, "MT_Exp023.2_18-001-B","analysis-results") input.data.folder.3 <- file.path(input.data.dir, "MT_Exp023.2_18-001-C","analysis-results") input.data.folder.4 <- file.path(input.data.dir, "MT_Exp023.2_18-002-A","analysis-results") input.data.folder.5 <- file.path(input.data.dir, "MT_Exp023.2_18-002-B","analysis-results")
  • 21. Read multiple text files to a single data frame (2/4) # Check if input folders exist dir.exists(path=c( input.data.folder.1 ,input.data.folder.2 ,input.data.folder.3 ,input.data.folder.4 ,input.data.folder.5))
  • 22. Read multiple text files to a single data frame (3/4) # Get full paths of input files input.data.file.paths <- list.files(path = c( input.data.folder.1 ,input.data.folder.2 ,input.data.folder.3 ,input.data.folder.4 ,input.data.folder.5) ,pattern = ".*cell-segmentation-summary_long-format_based-on-merged-cell-seg-file.tsv " ,full.names = TRUE ) # length(input.data.file.paths) 5
  • 23. Read multiple text files to a single data frame (4/4) # Read multiple tsv files to a list input.data.list <- lapply( X=input.data.file.paths ,FUN= function(x) read.delim( file=x ,header = TRUE ,stringsAsFactors = F) ) # class(input.data.list) "list" # length(input.data.list) 5 # Combine list elements to a single data.frame input.data.read <- do.call(what = "rbind", args = input.data.list) # dim(input.data.read) 375 14
  • 25. What is parallel computing Task 4 Task 3 Task 2 Task 1 TimeMulti-core processor Runs on multiple CPU cores, solving tasks in parallel (simultaneously). Ideal for independent tasks (i.e. Task 2 does not rely on the result from task 1) Run time < serial computing
  • 26. Parallelised programming in R Use it when you run a batch of similar tasks that are independent of each other ● Call an R script in a Shell script multiple times in a super computer ● doParallel, parallel, foreach packages in a local computer
  • 27. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop Load required packages into R (Windows users)
  • 28. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop Find number of CPU cores in your computer
  • 29. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop Use 2 CPU cores for R, leave the other 2 for software running in the background
  • 30. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop Register the cluster
  • 31. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop Specify arguments in a foreach loop for (i in 1:10){ Command1 Command2 ... }
  • 32. # Load required packages library(doParallel) library(parallel) library(foreach) # Detect number of CPU cores in your computer parallel::detectCores() # 4 cores detected # Set up a backend with 2 CPU cores cluster <- parallel::makeCluster(parallel::detectCores() -2 ) doParallel::registerDoParallel(cluster) # foreach general form foreach( i=1:10 ,.combine = 'c',.packages = c("package.A",”package.B”))%dopar%{ Command.1 Command.2 ... } Syntax forms of multiple CPU cores and 1 foreach loop for (i in 1:10){ Command1 Command2 ... } Parallelise tasks with %dopar%
  • 33. Syntax form of nested foreach loops # nested foreach general form foreach( i=1:10 ,.combine = 'rbind')%:% foreach(j=1:5 ,.combine = 'rbind' ,.packages = c("package.A","package.B"))%dopar%{ command.1 command.2 } %:% concatenates the outer and inner foreach loop Parallelise computation using %dopar%
  • 35. Compare time used by vectorised & parallel computing The testing tool: a birthday simulator ● A function to calculate the probability of having at least 2 people with same birthdays given N people in the same room ● Returns N probabilities The timing tool: system.time()
  • 36. Compare time used by vectorised & parallel computing The birthday simulation function # Birthday problem simulator pbirthdaysim <- function(n){ ## n: number of people in the room ## ntests: number of simulations and averaging the results ntests <- 100000 pop <- 1:365 anydup <- function(i) any(duplicated( sample(pop, n, replace=TRUE))) sum(sapply(seq(ntests), anydup)) / ntests }
  • 37. Compare the time used by vectorised and parallel computing system.time( # run birthday simulator using lapply()) system.time( # run birthday simulator using sapply()) system.time( # run birthday simulator using a for loop) system.time( # run birthday simulator using 1 CPU core and foreach loop) system.time( # run birthday simulator using all CPU cores and 1 foreach loop)
  • 38. Timing serial and parallel programming Testing conditions: ● Dell E7440 laptop (Intel Core i5-4300U 2 x 1.9 - 2.9 GHz, Haswell.) ● 1 million simulations Function Elapsed time lapply sapply For loop Foreach + 1 CPU core Foreach + all CPU cores detected sessionInfo() # R version 4.0.0 (2020-04-24) # Platform: x86_64-w64-mingw32/x64 (64-bit) # Running under: Windows 7 x64 (build 7601) Service Pack 1
  • 39. Don’t hesitate to ask yourself ● What is the time performance of my working code? ● Can I replace a loop with vectorised functions? ● If my computing tasks are independent, why haven’t I used multiple CPU cores and parallelised computing?
  • 40. Q & A My Qs: How many CPU cores detected in your computer? What are the elapsed times running the birthday simulator in your R? Your Qs?
  • 41. Serial and parallel processing in real world https://slideplayer.com/slide/7066858/
  • 42. What is a CPU core? A core, or CPU core, is the actual hardware component. It is the "brain" of a CPU. It receives instructions, and performs calculations, or operations, to satisfy those instructions. A CPU can have multiple cores. A processor with two cores is called a dual-core processor; with four cores, a quad-core; six cores, hexa-core; eight cores, octa-core. As of 2019, the majority of consumer CPUs feature between 2 and 12 cores.