session10
session10
• Unless you are looking for very rare events, you will get as much a
feeling for the data looking at a few thousands of data points as you
would from looking at a few million.
Working with large datasets: Chapter 5
• Here it is important that you pick a random sample.
• You need your data in a form that “dplyr” can manipulate, and if the data is
too large even to load into R, then you cannot have it in a data frame to
sample from, to begin with.
• Luckily, dplyr has support for using data that is stored on disk rather than in
RAM, in various backend formats, too.
• install.packages("ffbase") • library(ffbase)
• install.packages("ffbase2") • library(ffbase2)
ff package
https://bookdown.org/josephine_lukito/j381m_tutorials/ff.html
• “ff” is a package that helps you work with larger datasets.
• This is done using a flat file (hence the “ff”), which are
numeric vectors that point to disc memory.
Strategies:
• Visit: https://www.machinelearningplus.com/data-
manipulation/datatable-in-r-complete-guide/ for full tutorial on
data.table package and its syntax!
RevoScaleR package:
https://learn.microsoft.com/en-us/machine-learning-server/r-reference/revoscaler/revoscaler
• All five R packages (rhdfs, rhbase, plyrmr, rmr2 and ravro) of RHadoop,
their binary files, documentation, and tutorials, are available at a GitHub
repository at https://github.com/RevolutionAnalytics/RHadoop/wiki
Revolution R Enterprise, R with sparkR and
sparklyr package for R Studio and Azure/SQL:
• Revolution/Microsoft R Enterprise adds proprietary components e.g.
scaleR to support statistical analysis of Big Data, and is sold as
subscriptions for workstations, servers, Hadoop and databases.
• https://spark.apache.org/docs/latest/sparkr.html
• https://cloudblogs.microsoft.com/sqlserver/2021/06/30/looking-to-the-
future-for-r-in-azure-sql-and-sql-server/
Big Data in R:
https://www.columbia.edu/~sjm2186/EPIC_R/EPIC_R_BigData.pdf
Strategies: