SlideShare a Scribd company logo
TDWI - Accelerate
October 16, 2:30 – 3:15 PM EDT
Hyatt Regency, Bellevue
• Introduction to R
• Benefits and challenges
• R in Apache Spark: Distributed computing
• R in Databases: In-DB intelligence
Slideshare.net
• 3+M users
• Taught in most universities
• Thriving user groups worldwide
• 5th in 2016 IEEE Spectrum rank
• ~40% pro analysts prefer R (highest amongst R, SAS, python)
• 10,000+ contributed packages
• Many common use cases across industry
• Rich application & platform integration
What is
• The most popular statistical & ML programming language
• A data visualization tool
• Open source
Language
Platform
Community
Ecosystem
3
R Adoption is on a tear
76% of analytic
professionals use R
36% select R as
their primary tool
R Usage Growth
Rexer Data Miner Survey 2007-2015
2016 IEEE Spectrum rank
o In-Memory operation
o Lack of implicit parallelism
o Expensive data movement & duplication
6
7
Scaling R on Spark clusters
• What is Spark?
• An unified, open source,
parallel, data processing
framework for Big Data
Analytics
SparkR: R API included with Apache Spark
8
9
Data processing and modeling with SparkR
MLlib: Apache Spark's scalable machine learning library
sparklyr: R interface for Apache Spark
Source: http://spark.rstudio.com/
• Easy installation from CRAN
• Loads data into SparkDataFrame from:
local R data frames, Hive tables, CSV,
JSON, and Parquet files.
• Connect to both local instances of
Spark and remote Spark clusters
10
dplyr and ML in sparklyr
• Includes 3 family of ML functions for machine learning pipeline
• ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package.
• K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA
• ft_*: Feature transformers for manipulating individual features.
• sdf_*: Functions for manipulating SparkDataFrames.
• Provides a complete dplyr backend for data manipulation and
analysis
%>%
11
h2o: prediction engine in R
http://www.h2o.ai/product/
• Open source ML platform
• Optimized for “in memory” distributed, parallel ML
• Data manipulation and modeling on H2OFrame:
R functions + h2o pre-fixed functions.
• Transformations: h2o.group_by(), h2o.impute()
• Statistics: h2o.summary(), h2o.quantile(), h2o.mean()
• Algorithms: h2o.glm(), h2o.naiveBayes(),
h2o.deeplearning(), h2o.kmeans(), ...
• rsparkling package: h2o on Spark
• Provides bindings to h2o’s machine learning
algorithms: extension package for sparklyr
• Simple data conversion: SparkDataFrame ->
H2OFrame
12
https://github.com/h2oai/rsparkling
ML Server 9.x: Scale-out R
• 100% compatible with open source R
• Virtually any code/package that works today with R will work in ML Server.
• Ability to parallelize any R function
• Ideal for parameter sweeps, simulation, scoring.
• Wide range of scalable and distributed rx pre-fixed functions in
RevoScaleR package.
• Transformations: rxDataStep()
• Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()…
• Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()…
• Parallelism: rxSetComputeContext()
13
Free Developer’s version available
14
https://aka.ms/freemrs
ScaleR library: parallel and portable for Big Data
Stream data into blocks from sources: Hive tables, CSV, Parquet,
XDF, ODBC and SQL Server.
ScaleR algorithms work inside
multiple cores / nodes in
parallel at high speed
Interim results are collected and
combined analytically to
produce the output on the
entire data set
XDF file format is optimised to work with the ScaleR library and
significantly speeds up iterative algorithm processing.
15
Write once - deploy anywhere (WODA)
ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc.
Models can be trained in one and deployed in another
### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ###
mySparkCC <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(mySparkCC)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = hdfsFS)
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1)
### Linear model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### SETUP LOCAL ENVIRONMENT VARIABLES ###
myLocalCC <- “localpar”
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext(myLocalCC)
### CREATE LINUX, DIRECTORY AND FILE OBJECTS ###
linuxFS <- RxNativeFileSystem( )
AirlineDataSet <- RxXdfData(“airline_20MM.xdf”,
fileSystem = linuxFS)
Local Parallel processing - Linux or Windows In – Spark
Compute
context R script
- sets where the
model will run
Functional model
R script – does
not need to
change to run in
Spark
16
Spark clusters in Azure HDInsight
• Provisions Azure compute
resources with Spark 2.1
installed and configured.
• Supports multiple versions
(e.g. Spark 1.6).
• Stores data in Azure Blob
storage (WASB), Azure Data
Lake Store or Local HDFS.
17
ML Server Spark cluster architecture
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
R R R R R
R R R R R
ML Server
Data in Distributed Storage
R process on Edge Node
18
Model deployment using ML Server
operationalization services (mrsdeploy)
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
 In-cloud or on-prem
 Adding nodes to scale
 High availability & load balancing
 Remote execution server
Microsoft ML Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
19
Prepare/Explore:
OperationalizeModel
Prepare/
Explore
Typical advanced analytics lifecycle
20
21
22
23
scoringFn <- function(newdata){
library(RevoScaleR)
data <- rxImport(newdata)
rxPredict(model, data)
}
ML Server on Hadoop/HDInsight scales to hundreds of
nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
Base and scalable approaches comparison
Approach Scalability Spark Hadoop SQL Server Teradata Support
CRAN R1 Single machines Community
SparkR Single + Distributed
computing
X Community
sparklyr Single + Distributed
computing
X Community
h2o Single + Distributed
computing
X X Community
RevoScaleR Single + Distributed
computing
X X X X Enterprise
1. CRAN R indicates no additional R packages installed
25
tinyurl.com/Strata2017R
https://aka.ms/kdd2017r
26
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/StrataSanJose2017
https://learnanalytics.microsoft.com/
https://github.com/Azure/Azure-MachineLearning-
DataScience/tree/master/Misc/KDD2017MRS
27
https://docs.microsoft.com/en-us/machine-learning-server/what-is-machine-learning-server
28
https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-overview
29
For Oracle In DB analytics, see: https://www.oracle.com/database/advanced-
analytics/index.html
In-database machine learning
Develop Train Deploy Consume
Develop, explore and
experiment in your favorite
IDE
Train models with
sp_execute_external_
script and save the
models in database
Deploy your ML scripts
with sp_execute_external_
script and predict using the
models
Make your app/reports
intelligent by consuming
predictions
31
Eliminate data movement
Operationalize ML scripts and models
Enterprise grade performance and scale
SQL Transformations
Relational data
Analytics library
32
Free Developer’s versions available
33
https://aka.ms/sqlserverdeveloper
R services in-database: Data exploration and
predictive modeling (Data Scientist)
34
35
36
EXEC TrainTipPredictionModel
37
38
39
40
41
42
https://docs.microsoft.com/en-us/sql/advanced-analytics/getting-started-with-
machine-learning-services
https://blogs.msdn.microsoft.com/microsoft_press/2016/10/19/fre
e-ebook-data-science-with-microsoft-sql-server-2016/
43
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R

More Related Content

What's hot (20)

R server and spark
R server and sparkR server and spark
R server and spark
BAINIDA
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
Carol McDonald
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
DataWorks Summit
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
R server and spark
R server and sparkR server and spark
R server and spark
BAINIDA
 
Getting Started with HBase
Getting Started with HBaseGetting Started with HBase
Getting Started with HBase
Carol McDonald
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
Ahsan Javed Awan
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
Cloudera, Inc.
 
Hug france-2012-12-04
Hug france-2012-12-04Hug france-2012-12-04
Hug france-2012-12-04
Ted Dunning
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
Humoyun Ahmedov
 
Apache Spark Overview @ ferret
Apache Spark Overview @ ferretApache Spark Overview @ ferret
Apache Spark Overview @ ferret
Andrii Gakhov
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
MapR Technologies
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
Show me the Money! Cost & Resource  Tracking for Hadoop and Storm Show me the Money! Cost & Resource  Tracking for Hadoop and Storm
Show me the Money! Cost & Resource Tracking for Hadoop and Storm
DataWorks Summit/Hadoop Summit
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
DataWorks Summit
 
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalRMADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
MADlib Architecture and Functional Demo on How to Use MADlib/PivotalR
PivotalOpenSourceHub
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
Databricks
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
Carol McDonald
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 

Similar to TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R (20)

Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Ml2
Ml2Ml2
Ml2
poovarasu maniandan
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Data Science
Data ScienceData Science
Data Science
Subhajit75
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
Khalid Salama
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Big Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL ServerBig Data Analytics with Hadoop, MongoDB and SQL Server
Big Data Analytics with Hadoop, MongoDB and SQL Server
Mark Kromer
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
Dipendra Kusi
 
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Extending the R API for Spark with sparklyr and Microsoft R Server with Ali Z...
Databricks
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
Khalid Salama
 

Recently uploaded (20)

How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjksPpt. Nikhil.pptxnshwuudgcudisisshvehsjks
Ppt. Nikhil.pptxnshwuudgcudisisshvehsjks
panchariyasahil
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 

TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics with R

  • 1. TDWI - Accelerate October 16, 2:30 – 3:15 PM EDT Hyatt Regency, Bellevue
  • 2. • Introduction to R • Benefits and challenges • R in Apache Spark: Distributed computing • R in Databases: In-DB intelligence Slideshare.net
  • 3. • 3+M users • Taught in most universities • Thriving user groups worldwide • 5th in 2016 IEEE Spectrum rank • ~40% pro analysts prefer R (highest amongst R, SAS, python) • 10,000+ contributed packages • Many common use cases across industry • Rich application & platform integration What is • The most popular statistical & ML programming language • A data visualization tool • Open source Language Platform Community Ecosystem 3
  • 4. R Adoption is on a tear 76% of analytic professionals use R 36% select R as their primary tool R Usage Growth Rexer Data Miner Survey 2007-2015 2016 IEEE Spectrum rank
  • 5. o In-Memory operation o Lack of implicit parallelism o Expensive data movement & duplication
  • 6. 6
  • 7. 7 Scaling R on Spark clusters • What is Spark? • An unified, open source, parallel, data processing framework for Big Data Analytics
  • 8. SparkR: R API included with Apache Spark 8
  • 9. 9 Data processing and modeling with SparkR MLlib: Apache Spark's scalable machine learning library
  • 10. sparklyr: R interface for Apache Spark Source: http://spark.rstudio.com/ • Easy installation from CRAN • Loads data into SparkDataFrame from: local R data frames, Hive tables, CSV, JSON, and Parquet files. • Connect to both local instances of Spark and remote Spark clusters 10
  • 11. dplyr and ML in sparklyr • Includes 3 family of ML functions for machine learning pipeline • ml_*: Machine learning algorithms for analyzing data provided by the spark.ml package. • K-Means, GLM, LR, Survival Regression, DT, RF, GBT, PCA, Naive-Bayes, Multilayer Perceptron, LDA • ft_*: Feature transformers for manipulating individual features. • sdf_*: Functions for manipulating SparkDataFrames. • Provides a complete dplyr backend for data manipulation and analysis %>% 11
  • 12. h2o: prediction engine in R http://www.h2o.ai/product/ • Open source ML platform • Optimized for “in memory” distributed, parallel ML • Data manipulation and modeling on H2OFrame: R functions + h2o pre-fixed functions. • Transformations: h2o.group_by(), h2o.impute() • Statistics: h2o.summary(), h2o.quantile(), h2o.mean() • Algorithms: h2o.glm(), h2o.naiveBayes(), h2o.deeplearning(), h2o.kmeans(), ... • rsparkling package: h2o on Spark • Provides bindings to h2o’s machine learning algorithms: extension package for sparklyr • Simple data conversion: SparkDataFrame -> H2OFrame 12 https://github.com/h2oai/rsparkling
  • 13. ML Server 9.x: Scale-out R • 100% compatible with open source R • Virtually any code/package that works today with R will work in ML Server. • Ability to parallelize any R function • Ideal for parameter sweeps, simulation, scoring. • Wide range of scalable and distributed rx pre-fixed functions in RevoScaleR package. • Transformations: rxDataStep() • Statistics: rxSummary(), rxQuantile(), rxChiSquaredTest(), rxCrossTabs()… • Algorithms: rxLinMod(), rxLogit(), rxKmeans(), rxBTrees(), rxDForest()… • Parallelism: rxSetComputeContext() 13
  • 14. Free Developer’s version available 14 https://aka.ms/freemrs
  • 15. ScaleR library: parallel and portable for Big Data Stream data into blocks from sources: Hive tables, CSV, Parquet, XDF, ODBC and SQL Server. ScaleR algorithms work inside multiple cores / nodes in parallel at high speed Interim results are collected and combined analytically to produce the output on the entire data set XDF file format is optimised to work with the ScaleR library and significantly speeds up iterative algorithm processing. 15
  • 16. Write once - deploy anywhere (WODA) ScaleR: Portable across multiple platforms – local, Spark, SQL-Server, etc. Models can be trained in one and deployed in another ### SETUP SPARK/HADOOP ENVIRONMENT VARIABLES ### mySparkCC <- RxSpark() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(mySparkCC) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = hdfsFS) ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary( ~ ArrDelay + DayOfWeek, data = AirlineDataSet, reportProgress = 1) ### Linear model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + CRSDepTime, data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### SETUP LOCAL ENVIRONMENT VARIABLES ### myLocalCC <- “localpar” ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext(myLocalCC) ### CREATE LINUX, DIRECTORY AND FILE OBJECTS ### linuxFS <- RxNativeFileSystem( ) AirlineDataSet <- RxXdfData(“airline_20MM.xdf”, fileSystem = linuxFS) Local Parallel processing - Linux or Windows In – Spark Compute context R script - sets where the model will run Functional model R script – does not need to change to run in Spark 16
  • 17. Spark clusters in Azure HDInsight • Provisions Azure compute resources with Spark 2.1 installed and configured. • Supports multiple versions (e.g. Spark 1.6). • Stores data in Azure Blob storage (WASB), Azure Data Lake Store or Local HDFS. 17
  • 18. ML Server Spark cluster architecture Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes R R R R R R R R R R ML Server Data in Distributed Storage R process on Edge Node 18
  • 19. Model deployment using ML Server operationalization services (mrsdeploy) Data Scientist Developer Easy Integration Easy Deployment Easy Setup  In-cloud or on-prem  Adding nodes to scale  High availability & load balancing  Remote execution server Microsoft ML Server configured for operationalizing R analytics Microsoft R Client (mrsdeploy package) Easy Consumption publishServiceMicrosoft R Client (mrsdeploy package) Data Scientist 19
  • 21. 21
  • 22. 22
  • 23. 23 scoringFn <- function(newdata){ library(RevoScaleR) data <- rxImport(newdata) rxPredict(model, data) }
  • 24. ML Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime Billions of rows Logistic Regression on NYC Taxi Dataset 2.2 TB
  • 25. Base and scalable approaches comparison Approach Scalability Spark Hadoop SQL Server Teradata Support CRAN R1 Single machines Community SparkR Single + Distributed computing X Community sparklyr Single + Distributed computing X Community h2o Single + Distributed computing X X Community RevoScaleR Single + Distributed computing X X X X Enterprise 1. CRAN R indicates no additional R packages installed 25 tinyurl.com/Strata2017R https://aka.ms/kdd2017r
  • 26. 26
  • 30. For Oracle In DB analytics, see: https://www.oracle.com/database/advanced- analytics/index.html
  • 31. In-database machine learning Develop Train Deploy Consume Develop, explore and experiment in your favorite IDE Train models with sp_execute_external_ script and save the models in database Deploy your ML scripts with sp_execute_external_ script and predict using the models Make your app/reports intelligent by consuming predictions 31
  • 32. Eliminate data movement Operationalize ML scripts and models Enterprise grade performance and scale SQL Transformations Relational data Analytics library 32
  • 33. Free Developer’s versions available 33 https://aka.ms/sqlserverdeveloper
  • 34. R services in-database: Data exploration and predictive modeling (Data Scientist) 34
  • 35. 35
  • 36. 36
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 42. 42