0% found this document useful (0 votes)

3 views

session10

The document provides an overview of working with big data in R, highlighting various packages and strategies for handling large datasets, such as dplyr, ff, and data.table. It emphasizes the importance of subsampling, using database backends, and leveraging parallel computing to improve data processing efficiency. Additionally, it discusses tools like RevoScaleR and RHadoop for scalable data analysis and integration with cloud services.

Uploaded by

pra Bee In adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

session10

Uploaded by

pra Bee In adhikari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Statistical Computing with R

Masters in Data Science 503 (S10)

Third Batch, SMS, TU, 2024
Shital Bhandary
Associate Professor
Statistics/Bio-statistics, Demography and Public Health Informatics
Patan Academy of Health Sciences, Lalitpur, Nepal
Faculty, Data Analysis and Decision Modeling, MBA, Pokhara University, Nepal
Faculty, FAIMER Fellowship in Health Professions Education, India/USA.
Review Preview (Unit 2, Session 4)
• Big data in R • Big data in R
• Subsampling • RevoScaleR & RHadoop
• ff, ffbase, ffbase2 packages
packages • Microsoft Open R and R
• data.table package Server
• DBI and dplyr backend • SparkR and sparklyr
for database packages
• DBI and dbplyr for • R for Azure SQL and SQL
database server
Working with large datasets in R
(Book: Beginning Data Science in R – Thomas Mailund)
• The concept of Big Data refers to very large datasets, sets of
sizes where you need data warehouses to store the data,
where you typically need sophisticated algorithms to handle
the data, and distributed computations to get anywhere with
it.
• How big is “Big data”?
• Dealing with Big Data is also part of data science. Working
with large datasets and how to deal with data that slows
down your analysis is very important knowledge and skill for
data scientists.
Big Data/Large dataset: Chapter 5
(Book: Beginning Data Science in R – Thomas Mailund)
• If we ignore the Big Data issue, what a large dataset is,
depends very much on what you want to do with the data.
That comes down to the complexity of what you are trying to
achieve.

• The science of what you can do with data in a given amount

of time, or a given amount of space (be it RAM or disk space
or whatever you need), is called complexity theory and is
one of the fundamental topics in computer science.
Working with large datasets: Chapter 5
• Subsample your data before you Analyze the Full Dataset

• You very rarely need to analyze a complete dataset to get at least an

idea of how the data behaves.

• Unless you are looking for very rare events, you will get as much a
feeling for the data looking at a few thousands of data points as you
would from looking at a few million.
Working with large datasets: Chapter 5
• Here it is important that you pick a random sample.

• Randomizing might remove a subtle signal, but with the power of

statistics, we can deal with random noise.

• It is much harder to deal with consistent biases we just don’t know

about.

• This is same as taking random sample from (target) population in

research!
You can use “dplyr” package for sampling:
• iris %>% sample_n(size = 5) #Select random sample of size 5
• iris %>% sample_frac(size = 0.02) #Select 2% random sample

• You need your data in a form that “dplyr” can manipulate, and if the data is
too large even to load into R, then you cannot have it in a data frame to
sample from, to begin with.

• Luckily, dplyr has support for using data that is stored on disk rather than in
RAM, in various backend formats, too.

• It is, for example, possible to connect a “database to dplyr” and sample

from a large dataset this way.
“dplyr” backend for (relational) database:
Beginning Data Science in R – Thomas Mailund
• These systems require that you set up a server for the data, though,
so a simpler solution, if your data is not already stored in a database,
is to use LiteSQL. LiteSQL works just on your file system but provides a
file format and ways of accessing it using SQL queries.
• You can open or create a LiteSQL file using the src_sqlite() function of
“dplyr” package:
• iris_db <- src_sqlite("iris_db.sqlite3", create = TRUE)
• You can pull out a table using tbl():
• iris_sqlite <- tbl(iris_db, "iris“) #This is a direct process too!
• iris_sqlite2 <- tbl(irs_db, sql(SELECT var1, var2, var3 FROM iris)) #Optional
Accessing SQL database in R: https://www.youtube.com/watch?v=Z5LPjh_EkJk
Run query with dplyr:
• Then you can use dplyr functions to make a query to it:
• iris_sqlite %>% group_by(Species) %>%
• summarise(mean.Petal.Length = mean(Petal.Length))
• ## Source: lazy query [?? x 2] (This means data is not in memory!)
• ## Database: sqlite 3.38.2 [C:\Users\Dell\Documents\iris_db.sqlite3]
• ## Species mean.Petal.Length
• ## <chr> <dbl>
• ## 1 setosa 1.462
• ## 2 versicolor 4.260
• ## 3 virginica 5.552
More here: https://caltechlibrary.github.io/data-carpentry-R-ecology-lesson/05-r-and-databases.html and
YouTube video: https://www.youtube.com/watch?v=7o_doY0FFc8
Alternatively: We can use RSQLite with DBI
https://rsqlite.r-dbi.org/
• install.packages("RSQLite") • dbWriteTable(con, "mtcars",
• library(RSQLite) mtcars)
• library(DBI) • dbListTables(con)
• # Create an ephemeral in-memory • dbListFields(con, "mtcars")
RSQLite database • dbReadTable(con, "mtcars")
• con <- • res <- dbSendQuery(con, "SELECT *
dbConnect(RSQLite::SQLite(), FROM mtcars WHERE cyl = 4")
":memory:") • dbFetch(res)
• dbListTables(con) • dbClearResult(res)
character(0) # 0 tables • dbDisconnect(con)
It is easy to use “dbplyr” to work with all types of dataset now! Chapter 21, R4DS, 2nd Edition.
https://r4ds.hadley.nz/databases
Problems working with large datasets in R:
Chapter 5
• Running out of memory during analysis
• R can be very wasteful of RAM because R remembers
more (than is immediately obvious) but shows less in the
output

• In R, all objects are immutable (unless we use a

workaround)

• Whenever you modify an object, you are actually creating

a new object
Plotting problems with large datasets in R:
Use “hexbin” and/or 2D Density plot!
• library(ggplot2)
• library(hexbin)
• Too large to plot (Use Hex plot and/or 2-D density plots for this)
• d <- data.frame(x = rnorm(10000), y = rnorm(10000))
• d %>% ggplot(aes(x = x, y = y)) + geom_point() # Large/cluttered
scatterplot
• d %>% ggplot(aes(x = x, y = y)) + geom_hex() # Requires “hexbin”
package
• d %>% ggplot(aes(x = x, y = y)) + geom_density_2d() # 2D Density plot
• d %>% ggplot(aes(x = x, y = y)) + geom_hex() + scale_fill_gradient(low =
"lightgray", high = "red") + geom_density2d(color = "black") # Hex with
2-D Density plot
What do these packages do for Big Data?
https://malouche.github.io/bigdata2018/largedata.html
• install.packages("ff") • library(ff)

• install.packages("ffbase") • library(ffbase)

• install.packages("ffbase2") • library(ffbase2)
ff package
https://bookdown.org/josephine_lukito/j381m_tutorials/ff.html
• “ff” is a package that helps you work with larger datasets.

• “ff” works by storing your data in disc storage.

• This is done using a flat file (hence the “ff”), which are
numeric vectors that point to disc memory.

• “ffbase”, a helper package that allows you to perform simple

functions with ff objects.
Example 2:
• library(ff) #ff = flat files
• ffcars <- as.ffdf(cars)
• summary(ffcars)
• library(ffbase) #Another set of fast models for big data in R!
• model <- bigglm(dist ~ speed, data = ffcars)
• summary(model)
• library(ffbase2)
• dplyr on ff
• Available from github: https://github.com/edwindj/ffbase2
• iris_f <- tbl_ffdf(iris)
• cars_f <- tbl_ffdf(mtcars, src=“./db_ff, name_cars”)
Book Example: Bureau of Transformation Statistics
(https://www.transtats.bts.gov/homepage.asp)
• flights.ff <- read.table.ffdf(file="flights_sep_oct15.txt", sep=",",
VERBOSE=TRUE, header=TRUE, next.rows=100000, colClasses=NA)
• csv-read=34.24sec ffdf-write=6.303sec TOTAL=40.543sec, 426.4 KB
• flights.table <- read.table("flights_sep_oct15.txt", sep=",",
header=TRUE) – done in 32 seconds, data size = 101.9 MB

• read.table.ffdf() of 2013-2014 flights of 2 GB data) – done with 28

files of 516.5 KB size (456 seconds with only 380 MB RAM used)
• read.table() of 2013-2014 flights of 2 GB) data – done with 1.3 GB
size single file (441 seconds with maximum of 4.85 GB RAM)
Working with Big Data in R: Summary 1!
https://rviews.rstudio.com/2019/07/17/3-big-data-strategies-for-r/

Strategies:

• Strategy 1: Sample and Model (Sample or slice?)

• Strategy 2: Chunk and pull (How to? ff, ffbase, ffbase2?)

• Strategy 3: Push Compute to Data (DBI and dplyr database backend!)

• Webinar Video: https://www.youtube.com/watch?v=ybKkdEuxxN8

Working with Big Data in R: Summary
from Big Data Analytics with R Book
• Unleashing the power of R from within:
• Use apply() family of functions instead of loop and pipes
for data.frames

• Use R package such as ff, ffbase, ffbase2 and bigmemory

packages

• Apply statistical methods to large R objects through biglm,

bigglm and ffbase packages
Working with Big Data in R: Summary
from Big Data Analytics with R Book
• Unleashing the power of R from within:

• Enhance the speed of data processing with R libraries

supporting parallel computing e.g. parallel (doParellel
backend and forreach package) and boot packages for the
parallel computing and bootstrapping in R

• Benefit from faster data manipulation methods available

in the data.table package
Data.table package:
https://okanbulut.github.io/bigdata/wrangling-big-data.html
• What is data.table?

• Why use data.table over tidyverse?

• Reading/writing data with data.table

• Using the i, j and by in data.table

Use “data.table” package in R:
• A frequent criticism of R is its inefficiency in handling large
datasets. That's where the R package data.table enters the
scene. If your datasets have more than tens of thousands of
rows, the data.table package is a must.
(https://psrc.github.io/r-data-table/)

• The data.table is an alternative to R’s default data.frame to

handle tabular data. The reason it’s so popular is because of
the speed of execution on larger data and the terse syntax.
Use “data.table” package in R:
• So, effectively you type less code and get much faster speed. It
is one of the most downloaded packages in R and is preferred
by Data Scientists.

• It is probably one of the best things that have happened to R

programming language as far as speed is concerned

• Visit: https://www.machinelearningplus.com/data-
manipulation/datatable-in-r-complete-guide/ for full tutorial on
data.table package and its syntax!
RevoScaleR package:
https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler

• The RevoScaleR library is a collection of portable, scalable, and

distributable R functions for importing, transforming, and analyzing
data at scale.

• Functions run on the RevoScaleR interpreter, built on open-source R,

engineered to leverage the multithreaded and multinode
architecture of the host platform.

• You can use it for descriptive statistics, generalized linear models, k-

means clustering, logistic regression, classification and regression
trees, and decision forests.
RevoScaleR is open sourced by Microsoft on June 2021: https://en.wikipedia.org/wiki/RevoScaleR
RevoScaleR: After Microsoft from 2015!
https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler

• The RevoScaleR library is found in Machine Learning Server

and Microsoft R products.

• You can use any R IDE to write R script calling functions

in RevoScaleR, but the script must run on a computer having
the interpreter and libraries.

• RevoScaleR is often preloaded into tools that integrate with

Machine Learning Server and R Client.
Microsoft R Open with R Studio in Windows 10 PC: https://www.youtube.com/watch?v=fZXjPYd2Q4Y
Hadoop and MapReduce frameworks in R:
You will need to know Linux commands!
• RHadoop package: It supports MapReduce, HDFS and Hbase database
management directly from the console of R language (R Studio server is
better!) [rmr2 and rhdfs of Rhadoop are used for MapReduce files!]

• The packages have been developed by Revolution Analytics, but due to

the acquisition of Revolution Analytics by Microsoft, the latter has
recently become the lead maintainer of the packages.

• All five R packages (rhdfs, rhbase, plyrmr, rmr2 and ravro) of RHadoop,
their binary files, documentation, and tutorials, are available at a GitHub
repository at https://github.com/RevolutionAnalytics/RHadoop/wiki
Revolution R Enterprise, R with sparkR and
sparklyr package for R Studio and Azure/SQL:
• Revolution/Microsoft R Enterprise adds proprietary components e.g.
scaleR to support statistical analysis of Big Data, and is sold as
subscriptions for workstations, servers, Hadoop and databases.

• Single-user licenses are available free for academic users (I have it

with VB GUI) as well as users competing in Kaggle data mining
competitions.

• More on its blog site: https://blog.revolutionanalytics.com/

Revolution R Enterprise, R with sparkR and
sparklyr package for R Studio and Azure/SQL:
• R in Azure SQL and SQL server is also available with sparkR & sparklyr!

• https://spark.apache.org/docs/latest/sparkr.html

• https://spark.rstudio.com/ #Connection to H2O is also possible with

sparklyr!

• https://cloudblogs.microsoft.com/sqlserver/2021/06/30/looking-to-the-
future-for-r-in-azure-sql-and-sql-server/
Big Data in R:
https://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf

Strategies:

• Option I: Make the data smaller

• Option II: Get a bigger computer
• Option III: Use data.table rather than data.frame
• Option IV: Buffer the data set on disk
• Option V: Split it up
Question/Queries?
Thank you!
@shitalbhandary

The Silver Way Techniques Tips and Tutor
9% (11)
The Silver Way Techniques Tips and Tutor
6 pages
R Graphics Essentials For Great Data Visualization 9781979748100 C
No ratings yet
R Graphics Essentials For Great Data Visualization 9781979748100 C
257 pages
Test Automation Tool For SAP S4HANA Cloud
100% (1)
Test Automation Tool For SAP S4HANA Cloud
71 pages
XML-Based Web Applications
No ratings yet
XML-Based Web Applications
114 pages
01-MSBA-615 - Introduction To R Programming and R Studio
No ratings yet
01-MSBA-615 - Introduction To R Programming and R Studio
47 pages
ICS Lab Manual New PDF
No ratings yet
ICS Lab Manual New PDF
71 pages
Week 5 Database
No ratings yet
Week 5 Database
68 pages
Notes 03 R Large Data
No ratings yet
Notes 03 R Large Data
8 pages
Introduction to Data Analysis Using R 35 Min Lecture
No ratings yet
Introduction to Data Analysis Using R 35 Min Lecture
17 pages
DSR LAB MANUAL - 10 programs
No ratings yet
DSR LAB MANUAL - 10 programs
34 pages
Ifw Deep Dive R-quick Guide
No ratings yet
Ifw Deep Dive R-quick Guide
12 pages
W2 Advanced Data Structures, IO & Control
No ratings yet
W2 Advanced Data Structures, IO & Control
44 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Tableau Module 10
No ratings yet
Tableau Module 10
55 pages
Stats With R
No ratings yet
Stats With R
103 pages
DS Lab
No ratings yet
DS Lab
31 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
2147
No ratings yet
2147
69 pages
Introduction & Data Science Platforms
No ratings yet
Introduction & Data Science Platforms
31 pages
Big R Data
No ratings yet
Big R Data
17 pages
blackwell_sen_tpm
No ratings yet
blackwell_sen_tpm
4 pages
DA_Lab_Week-1
No ratings yet
DA_Lab_Week-1
7 pages
Using R in Azure ML
No ratings yet
Using R in Azure ML
63 pages
Pruim 2011 Computational Statistics Using R and R Studio
No ratings yet
Pruim 2011 Computational Statistics Using R and R Studio
115 pages
14 Work With Big Data
No ratings yet
14 Work With Big Data
74 pages
Data Science - Notes
No ratings yet
Data Science - Notes
68 pages
R Programming Presentation
100% (1)
R Programming Presentation
23 pages
Data_analysis_with_R _24
No ratings yet
Data_analysis_with_R _24
47 pages
Introduction to R
No ratings yet
Introduction to R
23 pages
SEU - DS510 - Module 4 Input-Output and Data Structure
No ratings yet
SEU - DS510 - Module 4 Input-Output and Data Structure
68 pages
(Ebook) R for Health Data Science by Ewen Harrison, Pius Riinu ISBN 9780367428327, 0367428326 - The ebook in PDF and DOCX formats is ready for download now
100% (2)
(Ebook) R for Health Data Science by Ewen Harrison, Pius Riinu ISBN 9780367428327, 0367428326 - The ebook in PDF and DOCX formats is ready for download now
71 pages
CRC.Data.Science
No ratings yet
CRC.Data.Science
443 pages
An Introduction To: Thibaut Jombart
No ratings yet
An Introduction To: Thibaut Jombart
115 pages
Database Connections in R
No ratings yet
Database Connections in R
10 pages
Handout 2
No ratings yet
Handout 2
15 pages
1.introduction To Bigdata Chap1
No ratings yet
1.introduction To Bigdata Chap1
35 pages
Starting With R
No ratings yet
Starting With R
34 pages
Using R For Introductory Statistics 1st Edition John Verzani - The full ebook version is just one click away
No ratings yet
Using R For Introductory Statistics 1st Edition John Verzani - The full ebook version is just one click away
46 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Ebook: Data Visualization Tools For Users (English)
No ratings yet
Ebook: Data Visualization Tools For Users (English)
26 pages
Practical File R by Komal
No ratings yet
Practical File R by Komal
26 pages
DR - Pierpaolo-Delser - Introduction R
No ratings yet
DR - Pierpaolo-Delser - Introduction R
83 pages
Part I: Introductory Materials: Introduction To R
No ratings yet
Part I: Introductory Materials: Introduction To R
25 pages
big data analytics
No ratings yet
big data analytics
2 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
ACFrOgBH3QzJqtesK4NqhLXNa89YjuS3PaAHn6kik2EC-R4sYvVX0XGFvE8x_Ht58eFFQEc9gzIMgpDiuPIQZWqTXZsOizAWpAQYieh_XY81COXksihekdcTTl6I_u_q0yu-dJYvyI2TJ-67I7L6sC0OM0Q0Rq9vdhlbv9SV2PsshAItQ_Jw3yJvbsJm
No ratings yet
ACFrOgBH3QzJqtesK4NqhLXNa89YjuS3PaAHn6kik2EC-R4sYvVX0XGFvE8x_Ht58eFFQEc9gzIMgpDiuPIQZWqTXZsOizAWpAQYieh_XY81COXksihekdcTTl6I_u_q0yu-dJYvyI2TJ-67I7L6sC0OM0Q0Rq9vdhlbv9SV2PsshAItQ_Jw3yJvbsJm
12 pages
R Handout Statistics and Data Analysis Using R
No ratings yet
R Handout Statistics and Data Analysis Using R
91 pages
l9 Scientific Python Proc
No ratings yet
l9 Scientific Python Proc
30 pages
Big Data Analytics (Bda) : Laboratory Workbook
No ratings yet
Big Data Analytics (Bda) : Laboratory Workbook
20 pages
mtech final
No ratings yet
mtech final
16 pages
Beginner Guide To R and R Studio V1
No ratings yet
Beginner Guide To R and R Studio V1
27 pages
Data Science Lab Manual
No ratings yet
Data Science Lab Manual
40 pages
The R Primer 2nd Edition Claus Thorn Ekstrøm - Quickly download the ebook to explore the full content
100% (1)
The R Primer 2nd Edition Claus Thorn Ekstrøm - Quickly download the ebook to explore the full content
54 pages
34314
No ratings yet
34314
51 pages
PDF Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson download
100% (10)
PDF Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson download
66 pages
Rintro
No ratings yet
Rintro
33 pages
Complete Download The R Primer 2nd Edition Claus Thorn Ekstrøm PDF All Chapters
100% (7)
Complete Download The R Primer 2nd Edition Claus Thorn Ekstrøm PDF All Chapters
60 pages
Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson All Chapters Instant Download
100% (1)
Exploratory Data Analysis Using R 1st Edition Ronald K. Pearson All Chapters Instant Download
47 pages
A Brief Introduction To R
No ratings yet
A Brief Introduction To R
17 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
PostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance
From Everand
PostgreSQL 9 Administration Cookbook LITE: Configuration, Monitoring and Maintenance
Simon Riggs
3/5 (1)
02 Decision Making Under Uncertainty and Risk
No ratings yet
02 Decision Making Under Uncertainty and Risk
12 pages
01 Decision Analysis Presentation Group D
No ratings yet
01 Decision Analysis Presentation Group D
78 pages
Unit1-Introduction
No ratings yet
Unit1-Introduction
38 pages
Unit2-AssociationAnalysis-V2
No ratings yet
Unit2-AssociationAnalysis-V2
46 pages
HyperParameterTuning
No ratings yet
HyperParameterTuning
4 pages
Unit4-Clustering-Evaluation
No ratings yet
Unit4-Clustering-Evaluation
53 pages
Unit3-SVM
No ratings yet
Unit3-SVM
20 pages
Unit4-Clustering-Algorithms
No ratings yet
Unit4-Clustering-Algorithms
43 pages
Unit4-HAC-Example
No ratings yet
Unit4-HAC-Example
7 pages
Unit4-Clustering
No ratings yet
Unit4-Clustering
46 pages
APznzaYpHxdi6DaMDg_u95kHZC81v98xP6xeQCN2Eghpjliqsda80wX4_k954UbDAgFnYcpGmYQeBdoKndBPyqEsm3-SrDdYk441Rfg0-7sy44x5Hlo8hsnCmCkJQXyg-lJGt8pqN3Dsdo5gMuFT9TiUlRMoV3yNw4SZiM9S1NRqrU-TZanvzqAsBYyDn
No ratings yet
APznzaYpHxdi6DaMDg_u95kHZC81v98xP6xeQCN2Eghpjliqsda80wX4_k954UbDAgFnYcpGmYQeBdoKndBPyqEsm3-SrDdYk441Rfg0-7sy44x5Hlo8hsnCmCkJQXyg-lJGt8pqN3Dsdo5gMuFT9TiUlRMoV3yNw4SZiM9S1NRqrU-TZanvzqAsBYyDn
40 pages
Unit3-KNN-Examples
No ratings yet
Unit3-KNN-Examples
7 pages
S28
No ratings yet
S28
35 pages
Final Year Report Presentation Edited
No ratings yet
Final Year Report Presentation Edited
52 pages
Sheeza PTCL Report
No ratings yet
Sheeza PTCL Report
37 pages
AWS Cloud Adoption Framework (AWS CAF) [Autosaved]
No ratings yet
AWS Cloud Adoption Framework (AWS CAF) [Autosaved]
12 pages
Lecture Notes Fit1047
No ratings yet
Lecture Notes Fit1047
74 pages
Running Risk Analysis For The SAP S - 4HANA and SAP ... - SAP Community
No ratings yet
Running Risk Analysis For The SAP S - 4HANA and SAP ... - SAP Community
15 pages
Lesson-05 - Creating and Modifying Databases
No ratings yet
Lesson-05 - Creating and Modifying Databases
3 pages
Takeitsmart-Web Development - Internship-Report
100% (1)
Takeitsmart-Web Development - Internship-Report
14 pages
Customer Exits: Bonus Material For 201 ABAP Interview Questions
No ratings yet
Customer Exits: Bonus Material For 201 ABAP Interview Questions
59 pages
Adafruit Ccs811 Air Quality Sensor-1396546
No ratings yet
Adafruit Ccs811 Air Quality Sensor-1396546
22 pages
Midterm Labexercise6 Jarantilla
No ratings yet
Midterm Labexercise6 Jarantilla
27 pages
Building Guardrails for Large Language Models
No ratings yet
Building Guardrails for Large Language Models
20 pages
Roles MDG F
No ratings yet
Roles MDG F
70 pages
Gui Pyqt5
No ratings yet
Gui Pyqt5
2 pages
Ahmed Ayman CV
No ratings yet
Ahmed Ayman CV
2 pages
Download Complete Python Penetration Testing Essentials Techniques for ethical hacking with Python 2nd Edition Mohit PDF for All Chapters
100% (4)
Download Complete Python Penetration Testing Essentials Techniques for ethical hacking with Python 2nd Edition Mohit PDF for All Chapters
40 pages
Microsoft All Time Q
No ratings yet
Microsoft All Time Q
28 pages
Design Patterns Quick Reference Card
100% (26)
Design Patterns Quick Reference Card
2 pages
Coffee Break Python Slicing
No ratings yet
Coffee Break Python Slicing
110 pages
1 Sequencesbackgrounder
No ratings yet
1 Sequencesbackgrounder
10 pages
File Handling in C
No ratings yet
File Handling in C
15 pages
In Fa DashVerify Services Noexp
No ratings yet
In Fa DashVerify Services Noexp
5 pages
PreBoard1-SetB-2025-QP
No ratings yet
PreBoard1-SetB-2025-QP
8 pages
کمپریٹو میٹریا میڈیکا PDF
100% (1)
کمپریٹو میٹریا میڈیکا PDF
161 pages
Literature Survey - Ai Mini Project: Research Papers
No ratings yet
Literature Survey - Ai Mini Project: Research Papers
5 pages
Coordinate Vector: The Standard Representation Examples
100% (1)
Coordinate Vector: The Standard Representation Examples
4 pages
Matrices (Presentation)
No ratings yet
Matrices (Presentation)
19 pages
5 Qewqeqeqewqeqweqweqweqwe
No ratings yet
5 Qewqeqeqewqeqweqweqweqwe
3 pages

session10

Uploaded by

session10

Uploaded by

Statistical Computing with R

Masters in Data Science 503 (S10)

• The science of what you can do with data in a given amount

• You very rarely need to analyze a complete dataset to get at least an

• Randomizing might remove a subtle signal, but with the power of

• It is much harder to deal with consistent biases we just don’t know

• This is same as taking random sample from (target) population in

• It is, for example, possible to connect a “database to dplyr” and sample

• In R, all objects are immutable (unless we use a

• Whenever you modify an object, you are actually creating

• “ff” works by storing your data in disc storage.

• “ffbase”, a helper package that allows you to perform simple

• read.table.ffdf() of 2013-2014 flights of 2 GB data) – done with 28

• Strategy 1: Sample and Model (Sample or slice?)

• Strategy 2: Chunk and pull (How to? ff, ffbase, ffbase2?)

• Strategy 3: Push Compute to Data (DBI and dplyr database backend!)

• Webinar Video: https://www.youtube.com/watch?v=ybKkdEuxxN8

• Use R package such as ff, ffbase, ffbase2 and bigmemory

• Apply statistical methods to large R objects through biglm,

• Enhance the speed of data processing with R libraries

• Benefit from faster data manipulation methods available

• Why use data.table over tidyverse?

• Reading/writing data with data.table

• Using the i, j and by in data.table

• The data.table is an alternative to R’s default data.frame to

• It is probably one of the best things that have happened to R

• The RevoScaleR library is a collection of portable, scalable, and

• Functions run on the RevoScaleR interpreter, built on open-source R,

• You can use it for descriptive statistics, generalized linear models, k-

• The RevoScaleR library is found in Machine Learning Server

• You can use any R IDE to write R script calling functions

• RevoScaleR is often preloaded into tools that integrate with

• The packages have been developed by Revolution Analytics, but due to

• Single-user licenses are available free for academic users (I have it

• More on its blog site: https://blog.revolutionanalytics.com/

• https://spark.rstudio.com/ #Connection to H2O is also possible with

• Option I: Make the data smaller

You might also like