SlideShare a Scribd company logo
Intro to big data analytics using microsoft machine learning server with spark
Disclaimer:
This is NOT an R , Python ,
Hadoop or Spark Overview
• From its inception, R was designed to use only a
single thread (processor) at a time. Even today, R
works that way unless linked with multi-threaded
BLAS/LAPACK libraries.
• Microsoft R Open : automatically use all available
cores and processors to significantly reduce
computation times — without the need to change a
line of your R code.
broad range of scalable and distributed R & Python functions
Platforms & Data
Tools
Languages
Algorithms
Data Sources
Rattle Mrsdeploy
RESTful API
deployment
Real-Time
Scoring
Visualization
Tool
Integration
.csv Microsoft .XDF
In-database
deployment
Operationalization
Distributed Parallelized Algorithms:
•RevoScaleR library
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
ODBC/JDBC
• Small data many models
• Hybrid
• Large scale machine learning
• Batch training and scoring
• Model deployment for LOB applications
Intro to big data analytics using microsoft machine learning server with spark
Machine Learning
and Analytics
Big Data StoresInformation
Management
Big Data as a cornerstone for Azure Data Services
Transform data into intelligent actions and predictions
Action
People
Automated
Systems
Apps
Web
Mobile
Bots
Event Hubs
Data Catalog
Data Factory
HDInsight
(Hadoop and
Spark)
Stream Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
SQL Data
Warehouse
Data Lake Store
Data
Sources
Apps
Sensors
and
devices
Data
Cosmos DB
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Service
Cognitive
Services
Power BI
Azure Analysis
Services
SQL Database
Batch AI
Blob Store
Databricks
•
•
• Support:
• Inadequacy of Community Support
• Application Governance:
• Commonly prohibited in production systems without commercial vendor
•
•
•
•
•
•
•
•
•
•
• Support for 1.6, 2.0, 2.1 and 2.2
• Support for all three distributions of Hadoop on all three flavors of Linux
• Inherits high-performance big-data processing characteristics from Spark
• Easily integrate with different data sources e.g. HIVE, Parquet , Orc, XDF, csv, etc…
• E2E framework for running parallel workloads
• Interoperability with Sparklyr and H2O
• Best in-class operationalization
• Guaranteed backward compatibility
R User
Workstation
MLS Server for Hadoop
Data
Frames
YARN Resource
Management
Spark
Executor
Worker
Task
Spark
Executor
Worker
Task
Spark
Executor
Worker
Task
.csv, xdf, hive,
parquet, ORC
ScaleR
Master Task
Finalizer
Initiator
Edge Node
Spark
Driver
Ver. 1.6 or 2.0
RStudio Desktop/Server
Microsoft R Server
Data
Frames
Worker
Task
Worker
Task
Worker
Task
ScaleR
Master Task
Finalizer
Initiator
Remote Execution:
ssh
Web Services
MRSDeplo
y
R Tools (VSCode,
Vstudio,Rstudio, …)
BI Tools &
Applications
Jupyter Notebooks
Thin Client IDEs
https://
https://
Edge Node
RStudio
Rstudio Server
Built-in remote execute
functions in R Client/R Server
Tools to reconcile local and
remote
Execute .R script or
interactive R commands
Return results to local
Generate working snapshots
for resume and reuse
IDE agnostic
MLS Server
configured to
(Support Window Server, Linux
Server, Hadoop )
Remote Execute R Scripts
 Execute R Scripts
 Snapshot remote env.
 Logout remote server
 Login remote server
 Generate Diff report
 Reconcile Environment
DEMO :
Interacting with MLS Server
Rstudio Desktop (Remote Context),
Rstudio Server
### SETUP HADOOP ENVIRONMENT VARIABLES ###
myHadoopCluster <- RxSpark()
### HADOOP COMPUTE CONTEXT ###
rxSetComputeContext(myHadoopCluster)
### CREATE HDFS, DIRECTORY AND FILE OBJECTS ###
hdfsFS <- RxHdfsFileSystem()
### ANALYTICAL PROCESSING ###
### Statistical Summary of the data
rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1)
### CrossTab the data
rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T)
### Linear Model and plot
hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet)
plot(hdfsXdfArrLateLinMod$coefficients)
### LOCAL COMPUTE CONTEXT ###
rxSetComputeContext("local")
### CREATE, REFERENCE FILE OBJECTS ###
AirlineDataSet <-
RxTextData(file.path(dataRead,"AirlineDemoSmall.csv")
)
Local Parallel processing – Linux or Windows In – Hadoop (Spark / MR) or SqlServer
ScaleR models can be deployed from a server or edge node to run in Hadoop
without any functional R model re-coding for map-reduce
Functional model R
script – does not
need to change to
run in Hadoop
Compute
context R script
– sets where the
model will run
• Defines where the processing happens
• Current set compute context determines processing
location
• Write Once Deploy Anywhere (WODA) by changing
compute context
"/data/admin.csv"
"/data/admin.csv"
“admintable”
rxSetComputeContext(RxSpark(…))
mydata<-RxTextData("/data/admin.csv", fileSystem = RxHdfsFileSystem())
mylogIt <- rxLogit(admin~ gre + gpa + rank, data = mydata)
Using a Spark
compute context
Keep other code
unchanged
Rx…..Data
DEMO :
Switching Context
single_code_base.R
• Scale via parallel & distributed computation
• Scales via in-cluster & in-database execution
• Escapes R’s memory limitations
• Speed analysis by reducing data
movement
• Stop security risks with in-database & in-
Hadoop
• Deploys R to multiple platforms
f(x)
Task
Task
Task
RevoScaleR Algorithms & Functions
Task
Task
Task
Finalizer
Initiator
Data Larger
than RAM
Internal Flow:
1. Algorithm begins initiator process
2. Initiator distributes work to nodes
3. Finalizer collects results
4. Finalizer iterates or continues
5. Finalizer evaluates final model
6. Returns single model to calling script
… load a large dataset
Model_obj <- rxLinMod(…)
… run a RevoScaleR algorithm
Typical Characteristics
 One call, one answer
 Arbitrarily large data sets
 Arbitrarily large worker task set
 Mathematically the same as single-threaded
 Platform independent
 Most are written in C++ for speed
Remote RevoScaleR Algorithm
Internal Flow:
1. Algorithm on local checks compute context
2. If set remote, packages and ships request
3. Local script blocks (by default) awaiting response
4. Remote unpacks and executes in parallel
5. Remote returns results to local interface
6. Local interface returns results to script
… set compute context to Spark
rxSetComputeContext(RxSpark(…))
Model_obj <- rxLinMod(…)
… run a RevoScaleR algorithm
Big
data
Remote Execution:
 Executes RevoScaleR algos on remote data & CPUs
 “rxSetComputeContext” redirects to remote
 Algorithms in RevoScaleR library redirect as set
 Results are returned to script as though local
Local
RevoScaleR
Algorithm
Algorithm
Interface
Remote
Local
Task
Task
Task
Finalizer
Initiator
SQL Server, Hadoop, Hadoop MapReduce, HDInsight
RevoScaleR Remote Execution
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
RevoScaleR Spark + Sparklyr + H20
* Interoperability with Sparklyr and H2O
* Best in-class operationalization
* Guaranteed backward compatibility
DEMO :
Data manipulation using sparklyr,
Interoperability between sparklyr and RevoScaleR
IntelligentApps
• Directly integrate intelligence into apps.
• Streamlined real-time scoring:
• Web Services operationalization
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Mrsdeploy
RESTful API
deployment
Real-Time
Scoring
Visualization
Tool
Integration
In-database
deployment
https://www.r-bloggers.com/deploying-a-car-price-model-using-
r-and-azureml/
• Seamless integration
with authentication
solution:
LDAP/AD/AAD
• Secure connection:
HTTPS encrypted by
TLS 1.2/SSL
• Compliance with
Microsoft Security
Development
Lifecycle
R
Client
Load
Balancer
• Server level HA:
Introduce multiple
Web Nodes for
Active-Active backup
/ recovery, via load
balancer
• Data Store HA:
leverage Enterprise
grade DB, SQL Server
and Postgres’ HA
capabilities
• Easily create web services from R scripts & models
Build the model first Deploy as a web service instantly
# Run the following code in R
swagger <- api$swagger()
cat(swagger, file = "swagger.json",
append = FALSE)
Generate Swagger
Docs for Web Services
Popular Swagger Tools:
AutoRest or Code Generator
AutoRest.exe -CodeGenerator
CSharp -Modeler Swagger -
Input swagger.json -
Namespace Mynamespace
Run Swagger tools to
generate code
Write a few code to
consume the service
Data Scientist DeveloperDeveloper
DEMO :
Publish Webservice
Intro to big data analytics using microsoft machine learning server with spark
THANK YOU!
Choice • Open Source: Microsoft R Open:
• Microsoft innovations: Microsoft R Extensions:
Distributed Parallelized Algorithms:
•RevoScaleR library
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Variable Selection
Stepwise Regression
Simulation
Simulation (e.g. Monte Carlo)
Parallel Random Number Generation
Cluster Analysis
K-Means
Classification
Decision Trees
Decision Forests
Gradient Boosted Decision Trees
Naïve Bayes
Custom Parallel
rxDataStep
rxExec
rxExecBy – Many-Models
PEMA-R API Custom Algorithms
Data Step
Data import – Delimited, Fixed, SAS, SPSS, ODBC
Variable creation & transformation
Recode variables
Factor variables
Missing value handling
Sort, Merge, Split
Aggregate by category (means, sums)
Descriptive Statistics
Min / Max, Mean, Median (approx.)
Quantiles (approx.)
Standard Deviation
Variance
Correlation
Covariance
Sum of Squares (cross product matrix for set variables)
Pairwise Cross tabs
Risk Ratio & Odds Ratio
Cross-Tabulation of Data (standard tables & long form)
Marginal Summaries of Cross Tabulations
Statistical Tests
Chi Square Test
Kendall Rank Correlation
Fisher’s Exact Test
Student’s t-Test
Sampling
Subsample (observations & variables)
Random Sampling
Predictive Models
Sum of Squares (cross product matrix for set variables)
Quantiles (approx.)
Generalized Linear Models (GLM) exponential family
distributions: binomial, Gaussian, inverse Gaussian, Poisson,
Tweedie. Standard link functions: cauchit, identity, log, logit,
probit. User defined distributions & link functions.
Covariance & Correlation Matrices
Logistic Regression
Linear regression
Predictions/scoring for models
Residuals for all models
Learner Strength Learning tasks supported Sample Applications
Deep Neural Network Supports custom, multi-layer network
topology with filtered, convolutional,
and pooling bundles
Binary classification
Multi-class classification
Regression
Bing Ads Click Prediction
($50M per year revenue
gain);
Image Classification
Logistic regression L1, L2 regularization Binary classification
Multi-class classification.
Classifying user feedback
One-class SVM Easy to train learner for anomaly
detection
Anomaly Detection Fraud detection
Fast Tree Boosted decision tree. Similar to
XGBoost. Supports upto 1B features!
Binary classification
Regression
One of the most popular and
best performing learners
inside Microsoft.
Fast Forest state-of-the-art tree ensembles
(Random Forest)
Binary classification
Regression
Churn Prediction
Fast Linear (SDCA) Speed, scalability and supports L1,L2
regularization
Binary classification,
Regression
Outlook used for email spam
filtering
Transform Strength Description Applications
Text Battle tested, large language
support, performant (Bing, Office)
Performs natural language
processing of free text into
numerical representation.
Support ticket
classification,
Sentiment analysis
Categorical /
Categorical hash
Ease of use; 1 line of code to set Converts categories into
numerical data.
Ad Click Prediction
Feature Selection Ease of use Selects a subset of features to
speed up training time.
Sentiment analysis,
Ad Click Prediction
Pretrained
Image
Featurization
 FeaturizeImage()
– Used to identify parts of
images
– People, things, animals,
etc.
 FeaturizeText()
– Returns Ngram digest
& counts from many
partitions of text data
Text
Featurization
Featurizer
Featurizer
Ngrams
(phrases)
counts
Text
Data
Sets Featurizer
ngram
ngram
ngram
Image
Data
Sets
Image
contents
Featurizer
Featurizer
Image
contents
found
Featurizer
 GetSentiment()
– Pretrained to return
sentiment score (0-1)
– English only for now
Pretrained
Sentiment
Analysis
Featurizer
Featurizer
getSentime
nt()
Text
Data
Sets Featurizer
Sentiment
Score
DistributedModeling
• “Many Models” on Partitioned Data
• Parallelized ensemble modeling
• New rxExecBy framework:
• Superior to Spark gapply:
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
 rxEnsemble:
– Returns ensembled model
combining multiple types
– Ensembling settings
balance speed & accuracy
Many
Small
Models
Ensemble
Learning
Model 1
Model 2
rxEnsemble
Single or
Distributed
Data Sets
 ManyModels:
– Used to run model on
each of many partitions.
– Returns one model trained
per cohort (partition) of
data.
P3
Model P1
P2
P1
Model P2
Returns a
set of
Models
Data
Partitioned
by Cohort
Model P3
Model P1
Model P2
Model P3
Ensemble
Model
Scale • RevoScaleR
• Microsoft Machine Learning Library
• Create custom parallel algorithms and functions:
Distributed Parallelized Algorithms:
•RevoScaleR library
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Intro to big data analytics using microsoft machine learning server with spark
SQLServer • Fast modeling platform with operationalization in one.
• Minimum data movement speeds & secures
• Direct T-SQL Integration
• Real-Time Scoring Support
• SQL Server security & administration
• Rapid deployment of R models
as stored procedures.
• Python support. Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Distributed Parallelized Algorithms:
•RevoScaleR library
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
.csv Microsoft .XDF ODBC/JDBC
Intro to big data analytics using microsoft machine learning server with spark
(SQL Server 2017)
• Python support
• Microsoft Machine Learning package included
• Process multiple related models in parallel with the rxExecBy function
• Create a shared script library with R script package management
• Native scoring with T-SQL PREDICT
• In-place upgrade of R components
Spark • Distribute RevoScaleR algorithms using Spark
• Spark 1.6 or 2.0 Compatible
• Load Spark Dataframes from:
• Interop with SparkETL, SparkSQL
• Support for HDInsight, Hortonworks, Cloudera
& MapR Hadoop Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Distributed Parallelized Algorithms:
•RevoScaleR library
•MicrosoftML library
•Custom parallelization frameworks
Open source R algorithms
& visualizations:
•CRAN
•bioconductor
Plus:
•Deep Learning
•Pretrained Models
•Prebuilt Featurizers
.csv Microsoft .XDF ODBC/JDBC
Configuration:
• HDI cluster size: 100 nodes
- Edge node: D14 V2 (16 cores, 112GB)
- Worker Nodes: D12 (4 cores, 28GB)
• Dataset: NYC Taxi dataset, filtered,
transformed, and duplicated
• Number of columns: 19
• Format: CSV
• fs.azure.selfthrottling.read.factor=1
-200
300
800
1300
1800
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime(seconds)
Billions of rows
rxLogit on a 100 node HDInsight cluster
2.2 TB
Intro to big data analytics using microsoft machine learning server with spark
IntelligentApps
• Directly integrate intelligence into apps.
• Streamlined real-time scoring:
• Web Services operationalization
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Mrsdeploy
RESTful API
deployment
Real-Time
Scoring
Visualization
Tool
Integration
In-database
deployment
Built-in remote execute
functions in R Client/R Server
Tools to reconcile local and
remote
Execute .R script or
interactive R commands
Return results to local
Generate working snapshots
for resume and reuse
IDE agnostic
R Server
configured to
(Support Window Server, Linux
Server, Hadoop )
Remote Execute R Scripts
 Execute R Scripts
 Snapshot remote env.
 Logout remote server
 Login remote server
 Generate Diff report
 Reconcile Environment
Agility • Visual Studio
• Support for open source R Studio GUI
• Support for JupyterR
• Microsoft contributing to Rattle
Rattle
Platforms and Data
Tools
Language
Algorithms
Data Sources
Operationalization
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
recognized leader
broadest R capabilities
performance and scale
architectural flexibility
Rapid deployment
Harnesses the hybrid cloud
Scales R for big data workloads
Blends best of open source and Microsoft technologies
platform choice
Intro to big data analytics using microsoft machine learning server with spark
Machine
Learning
Services
In-Database analytics with SQL Server
In SQL Server 2016, Microsoft launched two server platforms for integrating
the popular open source R language with business applications:
• SQL Server R Services (In-Database), for integration with SQL Server
• Machine Learning Server, for enterprise-level R deployments on Windows and Linux servers
In SQL Server 2017, the name has been changed to reflect support for the
popular Python language:
• SQL Server Machine Learning Services (In-Database) supports both R and Python for in-
database analytics
• Microsoft Machine Learning Server supports R and Python deployments on Windows
servers—expansion to other supported platforms is planned for late 2017
Capability
• Extensible in-database analytics, integrated with R,
exposed through T-SQL
• Centralized enterprise library for analytic models
Benefits
SQL Server
Analytical engines
Integrate with R/Python
Data management layer
Relational data
Use T-SQL interface
Stream data in-memory
Analytics library
Share and collaborate
Manage and deploy
R
Data scientists
Business
analysts
Publish algorithms, interact
directly with data
Analyze through T-SQL,
tools, and vetted algorithms
DBAs
Manage storage and
analytics together
Machine Learning Services
SQL Server
2017 setup
Install Machine
Learning Services
(In-Database)
Consent to install
Microsoft R
Open/Python
Optional: Install R
packages
on SQL Server
2017 machine
Database
configuration
Enable R
language extension
in database
Configure path
for RRO runtime
in database
Grant EXECUTE
EXTERNAL SCRIPT
permission to users
CREATE EXTERNAL EXTENSION [R]
USING SYSTEM LAUNCHER
WITH (RUNTIME_PATH =
'c:revolutionbin‘)
GRANT EXECUTE SCRIPT ON
EXTERNAL EXTENSION::R TO
DataScientistsRole;
/* User-defined role / users */
ML runtime
usage
Resource governance
via resource pool
Monitoring via DMVs
Troubleshooting via
XEvents/ DMVs
CREATE RESOURCE POOL ML_runtimes FOR
EXTERNAL EXTENSION
WITH MAX_CPU_PERCENT = 20,
MAX_MEMORY_PERCENT = 10;
select * from
sys.dm_resource_governor_resouce_pools
where name = ‘ML_runtimes';
• Original R script:
• IrisPredict <- function(data, model){
• library(e1071)
• predicted_species <- predict(model, data)
• return(predicted_species)
• }
•
•
• library(RODBC)
• conn <- odbcConnect("MySqlAzure", uid = myUser, pwd =
myPassword);
• Iris_data <-sqlFetch(conn, "Iris_Data");
• Iris_model <-sqlQuery(conn, "select model from
my_iris_model");
• IrisPredict (Iris_data, model);
• Calling R script from SQL Server:
• /* Input table schema */
• create table Iris_Data (name varchar(100), length int, width
int);
• /* Model table schema */
• create table my_iris_model (model varbinary(max));
•
• declare @iris_model varbinary(max) = (select model from
my_iris_model);
• exec sp_execute_external_script
• @language = 'R'
• , @script = '
• IrisPredict <- function(data, model){
• library(e1071)
• predicted_species <- predict(model, data)
• return(predicted_species)
• }
• IrisPredict(input_data_1, model);
• '
• , @parallel = default
• , @input_data_1 = N'select * from Iris_Data'
• , @params = N'@model varbinary(max)'
• , @model = @iris_model
• with result sets ((name varchar(100), length int, width int
• , species varchar(30)));
• Values highlighted in yellow are SQL queries embedded in the original R script
• Values highlighted in aqua are R variables that bind to SQL variables by name
launchpad.exe
sp_execute_external_script
sqlservr.exe Named pipe
SQLOS
XEvent
MSSQLSERVER Service MSSQLLAUNCHPAD Service
(one per SQL Server instance)
What and how
to launch
“launcher”
Bxlserver.exe
sqlsatellite.dll
Bxlserver.exe
sqlsatellite.dll
Windows Satellite
Process
sqlsatellite.dll
Run query
• More efficient than standalone clients
• Data does not all have to fit in memory
• Reduced data transmission over the network
• Most R Open (and Python R) functions are single threaded
• We can stream data in parallel and batches from SQL Server to/from script
• Use the power of SQL Server and ML to develop, train, and operationalize
• SQL Server compute context (remote compute context)
• T-SQL queries
• Memory-optimized tables
• Columnstore indexes
• Data compression
• Parallel query execution
• Stored procedures
Reduced surface area
and isolation
“external scripts enabled”
is required
Script execution outside of
SQL Server process space
Script execution
requires explicit
permission
sp_execute_external_script requires
EXECUTE ANY EXTERNAL SCRIPT
for non-admins
SQL Server login/user required
and db/table access
Satellite processes have
limited privileges
Satellite processes run under low
privileged, local user accounts
in the SQLRUserGroup
Each execution is isolated
— different users with
different accounts
Windows firewall rules
block outbound traffic
MicrosoftML is a package for Machine Learning Server, Microsoft R Client, and
SQL Server Machine Learning Services that adds state-of-the-art data
transforms, machine learning algorithms, and pretrained models to Microsoft R
functionality.
• Data transforms helps you to compose, in a pipeline, a custom set of transforms that are
applied to your data before training or testing. The primary purpose of these transforms is
to allow you to featurize your data.
• Machine learning algorithms enable you to tackle common machine learning tasks such as
classification, regression and anomaly detection. You run these high-performance functions
locally on Windows or Linux machines or on Azure HDInsight (Hadoop/Spark) clusters.
• Pretrained models for sentiment analysis and image featurization can also be installed and
deployed with the MicrosoftML package.
Identify Server
Set Source
Set Context
Use
• Run R or Python inside SQL Server 2017 with ML Services
• Existing SQL 2016 clients can R using SQL R Services
SQL
Server
2017 Run R engine
from within
the Query
Processor
SQL
Server
2016/17 Run R From
within the
Query
Processor
Move
BIG
Work to
the Data
T-SQL
Apps
T-SQL
Script
T-SQL
Stored
ProcedureSQL
Server
2016/17
Enable smart
non-R apps
BI & Reporting;
Web apps
T-SQL
Script
R
Engine
R
script
Large Data Sets in Chunks
Remote
Execution
Context
Results
Parallel
Worker
Tasks
Parallel
Algorithm
Iterate/
Sequence
SQL
ApplicationsT-SQL + SQL
Server
2016/17
Events
T-SQL
Production
Apps
Models
SQL
Server
2017
Real time
scoring
engine
Stored Proc’s
and Triggers
Events
Intro to big data analytics using microsoft machine learning server with spark
Use familiar T-SQL
stored procedures to
invoke R scripts from
your application
• Turn R analytics  Web
services in one line of
code;
• Swagger-based REST
APIs, easy to consume,
with any programming
languages, including R!
• Deploying web service
server to any platform:
Windows, SQL,
Linux/Hadoop
• On-prem or in cloud
• Fast scoring, real time
and batch
• Scaling to a grid for
powerful computing with
load balancing
• Diagnostic and capacity
evaluation tools
• Enterprise
authentication:
AD/LDAP or AAD
• Secure connection:
HTTPS with SSL/TLS 1.2
• Enterprise grade high
availability
Instant Deployment Deploy to Anywhere Fast and Scalable Secure and Reliable
Data Scientist
Developer
Easy Integration
Easy Deployment
Easy Setup
 In-cloud or on-prem
 Adding nodes to scale
 High availability & load balancing
 Remote execution server
Machine Learning
Server
configured for
operationalizing R analytics
Microsoft R Client
(mrsdeploy package)
Easy Consumption
publishServiceMicrosoft R Client
(mrsdeploy package)
Data Scientist
• Seamless integration
with authentication
solution:
LDAP/AD/AAD
• Secure connection:
HTTPS encrypted by
TLS 1.2/SSL
• Compliance with
Microsoft Security
Development
Lifecycle
R
Client
Load
Balancer
• Server level HA:
Introduce multiple
Web Nodes for
Active-Active backup
/ recovery, via load
balancer
• Data Store HA:
leverage Enterprise
grade DB, SQL Server
and Postgres’ HA
capabilities
• Easily create web services from R scripts & models
Build the model first Deploy as a web service instantly
Function Description
publishService
Publish a predictive function as a Web
Service
deleteService Delete a Web Service
getService Get a Web Service
ListServices List the different published web services
serviceOption
Retrieve, set, and list the different service
options
updateService Updates a Web Service
# Run the following code in R
swagger <- api$swagger()
cat(swagger, file = "swagger.json",
append = FALSE)
Generate Swagger
Docs for Web Services
Popular Swagger Tools:
AutoRest or Code Generator
AutoRest.exe -CodeGenerator
CSharp -Modeler Swagger -
Input swagger.json -
Namespace Mynamespace
Run Swagger tools to
generate code
Write a few code to
consume the service
Data Scientist DeveloperDeveloper
Intro to big data analytics using microsoft machine learning server with spark
• Easily scale up a single
server to a grid to
handle more
concurrent requests
• Load balancing cross
compute nodes
• Scale overall
performance with
shared pools of
warmed up R shells.
R
Client
Snapshot Functions
createSnapshot
Create a snapshot of the remote session (workspace and
working directory)
loadSnapshot
Load a snapshot from the server into the remote session
(workspace and working directory)
listSnapshots Get a list of snapshots for the current user
downloadSnapshot Download a snapshot from the server
deleteSnapshot Delete a snapshot from the server
Remote Objects Management
listRemoteFiles
Get a list of files in the working directory of the remote
session
deleteRemoteFile
Delete a file from the working directory of the remote
R session
getRemoteFile
Copy a file from the working directory of the remote R
session
putLocalFile
Copy a file from the local machine to the working
directory of the remote R session
getRemoteObject Get an object from the remote R session
putLocalObject
Put an object from the local R session and load it into
the remote R session
getRemoteWorkspace
Take all objects from the remote R session and load
them into the local R session
putLocalWorkspace
Take all objects from the local R session and load them
into the remote R session
Remote Connection
remoteLogin Remote login to the R Server with AD or admin credentials
remoteLoginAAD Remote login to R Server server using Azure AD
remoteLogout Logout of the remote session on the DeployR Server.
Remote Execution
remoteExecute Remote execution of either R code or an R script
remoteScript Wrapper function for remote script execution
diffLocalRemote Generate a 'diff' report between local and remote
pause Pause remote connection and back to local
resume Return the user to the 'REMOTE >' command prompt
Intro to big data analytics using microsoft machine learning server with spark
Ad

More Related Content

What's hot (20)

LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data LEGO: Data Driven Growth Hacking Powered by Big Data
LEGO: Data Driven Growth Hacking Powered by Big Data
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
DataWorks Summit/Hadoop Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
DataWorks Summit
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
DataWorks Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
DataWorks Summit/Hadoop Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
Joseph Niemiec
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
DataWorks Summit/Hadoop Summit
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
DataWorks Summit/Hadoop Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Alex Zeltov
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
DataWorks Summit/Hadoop Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
Slim Baltagi
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
DataWorks Summit
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
DataWorks Summit/Hadoop Summit
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
DataWorks Summit/Hadoop Summit
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
DataWorks Summit
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
DataWorks Summit
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
DataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
DataWorks Summit
 

Similar to Intro to big data analytics using microsoft machine learning server with spark (20)

Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017
Mark Tabladillo
 
Introduction to Microsoft R (Graph)
Introduction to Microsoft R (Graph)Introduction to Microsoft R (Graph)
Introduction to Microsoft R (Graph)
Cheah Eng Soon
 
Introduction to Microsoft R
Introduction to Microsoft RIntroduction to Microsoft R
Introduction to Microsoft R
Cheah Eng Soon
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
r-kor
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
Synergetics Learning and Cloud Consulting
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Advanced analytics with R and SQL
Advanced analytics with R and SQLAdvanced analytics with R and SQL
Advanced analytics with R and SQL
MSDEVMTL
 
SQL Server R Services: What Every SQL Professional Should Know
SQL Server R Services: What Every SQL Professional Should KnowSQL Server R Services: What Every SQL Professional Should Know
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
Bob Ward
 
Analysing big data with cluster service and R
Analysing big data with cluster service and RAnalysing big data with cluster service and R
Analysing big data with cluster service and R
Lushi Chen
 
Building a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with RBuilding a Scalable Data Science Platform with R
Building a Scalable Data Science Platform with R
DataWorks Summit/Hadoop Summit
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017Machine learning services with SQL Server 2017
Machine learning services with SQL Server 2017
Mark Tabladillo
 
Introduction to Microsoft R (Graph)
Introduction to Microsoft R (Graph)Introduction to Microsoft R (Graph)
Introduction to Microsoft R (Graph)
Cheah Eng Soon
 
Introduction to Microsoft R
Introduction to Microsoft RIntroduction to Microsoft R
Introduction to Microsoft R
Cheah Eng Soon
 
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
RUCK 2017 R에 날개 달기 - Microsoft R과 클라우드 머신러닝 소개
r-kor
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Advanced analytics with R and SQL
Advanced analytics with R and SQLAdvanced analytics with R and SQL
Advanced analytics with R and SQL
MSDEVMTL
 
SQL Server R Services: What Every SQL Professional Should Know
SQL Server R Services: What Every SQL Professional Should KnowSQL Server R Services: What Every SQL Professional Should Know
SQL Server R Services: What Every SQL Professional Should Know
Bob Ward
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
Bob Ward
 
Analysing big data with cluster service and R
Analysing big data with cluster service and RAnalysing big data with cluster service and R
Analysing big data with cluster service and R
Lushi Chen
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
Sri Ambati
 
Ad

Recently uploaded (20)

Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
BCS401 ADA Second IA Test Question Bank.pdf
BCS401 ADA Second IA Test Question Bank.pdfBCS401 ADA Second IA Test Question Bank.pdf
BCS401 ADA Second IA Test Question Bank.pdf
VENKATESHBHAT25
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxFourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
VENKATESHBHAT25
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Machine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptxMachine learning project on employee attrition detection using (2).pptx
Machine learning project on employee attrition detection using (2).pptx
rajeswari89780
 
ELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdfELectronics Boards & Product Testing_Shiju.pdf
ELectronics Boards & Product Testing_Shiju.pdf
Shiju Jacob
 
Avnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights FlyerAvnet Silica's PCIM 2025 Highlights Flyer
Avnet Silica's PCIM 2025 Highlights Flyer
WillDavies22
 
Basic Principles for Electronics Students
Basic Principles for Electronics StudentsBasic Principles for Electronics Students
Basic Principles for Electronics Students
cbdbizdev04
 
Smart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptxSmart_Storage_Systems_Production_Engineering.pptx
Smart_Storage_Systems_Production_Engineering.pptx
rushikeshnavghare94
 
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdfMAQUINARIA MINAS CEMA 6th Edition (1).pdf
MAQUINARIA MINAS CEMA 6th Edition (1).pdf
ssuser562df4
 
How to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptxHow to Make Material Space Qu___ (1).pptx
How to Make Material Space Qu___ (1).pptx
engaash9
 
Mathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdfMathematical foundation machine learning.pdf
Mathematical foundation machine learning.pdf
TalhaShahid49
 
Reagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptxReagent dosing (Bredel) presentation.pptx
Reagent dosing (Bredel) presentation.pptx
AlejandroOdio
 
vlsi digital circuits full power point presentation
vlsi digital circuits full power point presentationvlsi digital circuits full power point presentation
vlsi digital circuits full power point presentation
DrSunitaPatilUgaleKK
 
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution ControlDust Suppressants: A Sustainable Approach to Dust Pollution Control
Dust Suppressants: A Sustainable Approach to Dust Pollution Control
Janapriya Roy
 
Artificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptxArtificial Intelligence (AI) basics.pptx
Artificial Intelligence (AI) basics.pptx
aditichinar
 
Value Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous SecurityValue Stream Mapping Worskshops for Intelligent Continuous Security
Value Stream Mapping Worskshops for Intelligent Continuous Security
Marc Hornbeek
 
Data Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptxData Structures_Searching and Sorting.pptx
Data Structures_Searching and Sorting.pptx
RushaliDeshmukh2
 
BCS401 ADA Second IA Test Question Bank.pdf
BCS401 ADA Second IA Test Question Bank.pdfBCS401 ADA Second IA Test Question Bank.pdf
BCS401 ADA Second IA Test Question Bank.pdf
VENKATESHBHAT25
 
fluke dealers in bangalore..............
fluke dealers in bangalore..............fluke dealers in bangalore..............
fluke dealers in bangalore..............
Haresh Vaswani
 
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptxFourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
Fourth Semester BE CSE BCS401 ADA Module 3 PPT.pptx
VENKATESHBHAT25
 
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design ThinkingDT REPORT by Tech titan GROUP to introduce the subject design Thinking
DT REPORT by Tech titan GROUP to introduce the subject design Thinking
DhruvChotaliya2
 
π0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalizationπ0.5: a Vision-Language-Action Model with Open-World Generalization
π0.5: a Vision-Language-Action Model with Open-World Generalization
NABLAS株式会社
 
Degree_of_Automation.pdf for Instrumentation and industrial specialist
Degree_of_Automation.pdf for  Instrumentation  and industrial specialistDegree_of_Automation.pdf for  Instrumentation  and industrial specialist
Degree_of_Automation.pdf for Instrumentation and industrial specialist
shreyabhosale19
 
Ad

Intro to big data analytics using microsoft machine learning server with spark

  • 2. Disclaimer: This is NOT an R , Python , Hadoop or Spark Overview
  • 3. • From its inception, R was designed to use only a single thread (processor) at a time. Even today, R works that way unless linked with multi-threaded BLAS/LAPACK libraries. • Microsoft R Open : automatically use all available cores and processors to significantly reduce computation times — without the need to change a line of your R code.
  • 4. broad range of scalable and distributed R & Python functions
  • 5. Platforms & Data Tools Languages Algorithms Data Sources Rattle Mrsdeploy RESTful API deployment Real-Time Scoring Visualization Tool Integration .csv Microsoft .XDF In-database deployment Operationalization Distributed Parallelized Algorithms: •RevoScaleR library •MicrosoftML library •Custom parallelization frameworks Open source R algorithms & visualizations: •CRAN •bioconductor Plus: •Deep Learning •Pretrained Models •Prebuilt Featurizers ODBC/JDBC
  • 6. • Small data many models • Hybrid • Large scale machine learning • Batch training and scoring • Model deployment for LOB applications
  • 8. Machine Learning and Analytics Big Data StoresInformation Management Big Data as a cornerstone for Azure Data Services Transform data into intelligent actions and predictions Action People Automated Systems Apps Web Mobile Bots Event Hubs Data Catalog Data Factory HDInsight (Hadoop and Spark) Stream Analytics Intelligence Data Lake Analytics Machine Learning SQL Data Warehouse Data Lake Store Data Sources Apps Sensors and devices Data Cosmos DB Intelligence Dashboards & Visualizations Cortana Bot Service Cognitive Services Power BI Azure Analysis Services SQL Database Batch AI Blob Store Databricks
  • 9. • • • Support: • Inadequacy of Community Support • Application Governance: • Commonly prohibited in production systems without commercial vendor
  • 11. • Support for 1.6, 2.0, 2.1 and 2.2 • Support for all three distributions of Hadoop on all three flavors of Linux • Inherits high-performance big-data processing characteristics from Spark • Easily integrate with different data sources e.g. HIVE, Parquet , Orc, XDF, csv, etc… • E2E framework for running parallel workloads • Interoperability with Sparklyr and H2O • Best in-class operationalization • Guaranteed backward compatibility
  • 12. R User Workstation MLS Server for Hadoop Data Frames YARN Resource Management Spark Executor Worker Task Spark Executor Worker Task Spark Executor Worker Task .csv, xdf, hive, parquet, ORC ScaleR Master Task Finalizer Initiator Edge Node Spark Driver Ver. 1.6 or 2.0 RStudio Desktop/Server Microsoft R Server
  • 13. Data Frames Worker Task Worker Task Worker Task ScaleR Master Task Finalizer Initiator Remote Execution: ssh Web Services MRSDeplo y R Tools (VSCode, Vstudio,Rstudio, …) BI Tools & Applications Jupyter Notebooks Thin Client IDEs https:// https:// Edge Node RStudio Rstudio Server
  • 14. Built-in remote execute functions in R Client/R Server Tools to reconcile local and remote Execute .R script or interactive R commands Return results to local Generate working snapshots for resume and reuse IDE agnostic MLS Server configured to (Support Window Server, Linux Server, Hadoop ) Remote Execute R Scripts  Execute R Scripts  Snapshot remote env.  Logout remote server  Login remote server  Generate Diff report  Reconcile Environment
  • 15. DEMO : Interacting with MLS Server Rstudio Desktop (Remote Context), Rstudio Server
  • 16. ### SETUP HADOOP ENVIRONMENT VARIABLES ### myHadoopCluster <- RxSpark() ### HADOOP COMPUTE CONTEXT ### rxSetComputeContext(myHadoopCluster) ### CREATE HDFS, DIRECTORY AND FILE OBJECTS ### hdfsFS <- RxHdfsFileSystem() ### ANALYTICAL PROCESSING ### ### Statistical Summary of the data rxSummary(~ArrDelay+DayOfWeek, data= AirlineDataSet, reportProgress=1) ### CrossTab the data rxCrossTabs(ArrDelay ~ DayOfWeek, data= AirlineDataSet, means=T) ### Linear Model and plot hdfsXdfArrLateLinMod <- rxLinMod(ArrDelay ~ DayOfWeek + 0 , data = AirlineDataSet) plot(hdfsXdfArrLateLinMod$coefficients) ### LOCAL COMPUTE CONTEXT ### rxSetComputeContext("local") ### CREATE, REFERENCE FILE OBJECTS ### AirlineDataSet <- RxTextData(file.path(dataRead,"AirlineDemoSmall.csv") ) Local Parallel processing – Linux or Windows In – Hadoop (Spark / MR) or SqlServer ScaleR models can be deployed from a server or edge node to run in Hadoop without any functional R model re-coding for map-reduce Functional model R script – does not need to change to run in Hadoop Compute context R script – sets where the model will run
  • 17. • Defines where the processing happens • Current set compute context determines processing location • Write Once Deploy Anywhere (WODA) by changing compute context
  • 21. rxSetComputeContext(RxSpark(…)) mydata<-RxTextData("/data/admin.csv", fileSystem = RxHdfsFileSystem()) mylogIt <- rxLogit(admin~ gre + gpa + rank, data = mydata) Using a Spark compute context Keep other code unchanged
  • 24. • Scale via parallel & distributed computation • Scales via in-cluster & in-database execution • Escapes R’s memory limitations • Speed analysis by reducing data movement • Stop security risks with in-database & in- Hadoop • Deploys R to multiple platforms f(x) Task Task Task
  • 25. RevoScaleR Algorithms & Functions Task Task Task Finalizer Initiator Data Larger than RAM Internal Flow: 1. Algorithm begins initiator process 2. Initiator distributes work to nodes 3. Finalizer collects results 4. Finalizer iterates or continues 5. Finalizer evaluates final model 6. Returns single model to calling script … load a large dataset Model_obj <- rxLinMod(…) … run a RevoScaleR algorithm Typical Characteristics  One call, one answer  Arbitrarily large data sets  Arbitrarily large worker task set  Mathematically the same as single-threaded  Platform independent  Most are written in C++ for speed
  • 26. Remote RevoScaleR Algorithm Internal Flow: 1. Algorithm on local checks compute context 2. If set remote, packages and ships request 3. Local script blocks (by default) awaiting response 4. Remote unpacks and executes in parallel 5. Remote returns results to local interface 6. Local interface returns results to script … set compute context to Spark rxSetComputeContext(RxSpark(…)) Model_obj <- rxLinMod(…) … run a RevoScaleR algorithm Big data Remote Execution:  Executes RevoScaleR algos on remote data & CPUs  “rxSetComputeContext” redirects to remote  Algorithms in RevoScaleR library redirect as set  Results are returned to script as though local Local RevoScaleR Algorithm Algorithm Interface Remote Local Task Task Task Finalizer Initiator SQL Server, Hadoop, Hadoop MapReduce, HDInsight RevoScaleR Remote Execution
  • 29. RevoScaleR Spark + Sparklyr + H20 * Interoperability with Sparklyr and H2O * Best in-class operationalization * Guaranteed backward compatibility DEMO : Data manipulation using sparklyr, Interoperability between sparklyr and RevoScaleR
  • 30. IntelligentApps • Directly integrate intelligence into apps. • Streamlined real-time scoring: • Web Services operationalization Platforms and Data Tools Language Algorithms Data Sources Operationalization Mrsdeploy RESTful API deployment Real-Time Scoring Visualization Tool Integration In-database deployment
  • 32. • Seamless integration with authentication solution: LDAP/AD/AAD • Secure connection: HTTPS encrypted by TLS 1.2/SSL • Compliance with Microsoft Security Development Lifecycle R Client
  • 33. Load Balancer • Server level HA: Introduce multiple Web Nodes for Active-Active backup / recovery, via load balancer • Data Store HA: leverage Enterprise grade DB, SQL Server and Postgres’ HA capabilities
  • 34. • Easily create web services from R scripts & models Build the model first Deploy as a web service instantly
  • 35. # Run the following code in R swagger <- api$swagger() cat(swagger, file = "swagger.json", append = FALSE) Generate Swagger Docs for Web Services Popular Swagger Tools: AutoRest or Code Generator AutoRest.exe -CodeGenerator CSharp -Modeler Swagger - Input swagger.json - Namespace Mynamespace Run Swagger tools to generate code Write a few code to consume the service Data Scientist DeveloperDeveloper
  • 39. Choice • Open Source: Microsoft R Open: • Microsoft innovations: Microsoft R Extensions: Distributed Parallelized Algorithms: •RevoScaleR library •MicrosoftML library •Custom parallelization frameworks Open source R algorithms & visualizations: •CRAN •bioconductor Plus: •Deep Learning •Pretrained Models •Prebuilt Featurizers Platforms and Data Tools Language Algorithms Data Sources Operationalization
  • 40. Variable Selection Stepwise Regression Simulation Simulation (e.g. Monte Carlo) Parallel Random Number Generation Cluster Analysis K-Means Classification Decision Trees Decision Forests Gradient Boosted Decision Trees Naïve Bayes Custom Parallel rxDataStep rxExec rxExecBy – Many-Models PEMA-R API Custom Algorithms Data Step Data import – Delimited, Fixed, SAS, SPSS, ODBC Variable creation & transformation Recode variables Factor variables Missing value handling Sort, Merge, Split Aggregate by category (means, sums) Descriptive Statistics Min / Max, Mean, Median (approx.) Quantiles (approx.) Standard Deviation Variance Correlation Covariance Sum of Squares (cross product matrix for set variables) Pairwise Cross tabs Risk Ratio & Odds Ratio Cross-Tabulation of Data (standard tables & long form) Marginal Summaries of Cross Tabulations Statistical Tests Chi Square Test Kendall Rank Correlation Fisher’s Exact Test Student’s t-Test Sampling Subsample (observations & variables) Random Sampling Predictive Models Sum of Squares (cross product matrix for set variables) Quantiles (approx.) Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions. Covariance & Correlation Matrices Logistic Regression Linear regression Predictions/scoring for models Residuals for all models
  • 41. Learner Strength Learning tasks supported Sample Applications Deep Neural Network Supports custom, multi-layer network topology with filtered, convolutional, and pooling bundles Binary classification Multi-class classification Regression Bing Ads Click Prediction ($50M per year revenue gain); Image Classification Logistic regression L1, L2 regularization Binary classification Multi-class classification. Classifying user feedback One-class SVM Easy to train learner for anomaly detection Anomaly Detection Fraud detection Fast Tree Boosted decision tree. Similar to XGBoost. Supports upto 1B features! Binary classification Regression One of the most popular and best performing learners inside Microsoft. Fast Forest state-of-the-art tree ensembles (Random Forest) Binary classification Regression Churn Prediction Fast Linear (SDCA) Speed, scalability and supports L1,L2 regularization Binary classification, Regression Outlook used for email spam filtering Transform Strength Description Applications Text Battle tested, large language support, performant (Bing, Office) Performs natural language processing of free text into numerical representation. Support ticket classification, Sentiment analysis Categorical / Categorical hash Ease of use; 1 line of code to set Converts categories into numerical data. Ad Click Prediction Feature Selection Ease of use Selects a subset of features to speed up training time. Sentiment analysis, Ad Click Prediction
  • 42. Pretrained Image Featurization  FeaturizeImage() – Used to identify parts of images – People, things, animals, etc.  FeaturizeText() – Returns Ngram digest & counts from many partitions of text data Text Featurization Featurizer Featurizer Ngrams (phrases) counts Text Data Sets Featurizer ngram ngram ngram Image Data Sets Image contents Featurizer Featurizer Image contents found Featurizer  GetSentiment() – Pretrained to return sentiment score (0-1) – English only for now Pretrained Sentiment Analysis Featurizer Featurizer getSentime nt() Text Data Sets Featurizer Sentiment Score
  • 43. DistributedModeling • “Many Models” on Partitioned Data • Parallelized ensemble modeling • New rxExecBy framework: • Superior to Spark gapply: Platforms and Data Tools Language Algorithms Data Sources Operationalization
  • 44.  rxEnsemble: – Returns ensembled model combining multiple types – Ensembling settings balance speed & accuracy Many Small Models Ensemble Learning Model 1 Model 2 rxEnsemble Single or Distributed Data Sets  ManyModels: – Used to run model on each of many partitions. – Returns one model trained per cohort (partition) of data. P3 Model P1 P2 P1 Model P2 Returns a set of Models Data Partitioned by Cohort Model P3 Model P1 Model P2 Model P3 Ensemble Model
  • 45. Scale • RevoScaleR • Microsoft Machine Learning Library • Create custom parallel algorithms and functions: Distributed Parallelized Algorithms: •RevoScaleR library •MicrosoftML library •Custom parallelization frameworks Open source R algorithms & visualizations: •CRAN •bioconductor Plus: •Deep Learning •Pretrained Models •Prebuilt Featurizers Platforms and Data Tools Language Algorithms Data Sources Operationalization
  • 47. SQLServer • Fast modeling platform with operationalization in one. • Minimum data movement speeds & secures • Direct T-SQL Integration • Real-Time Scoring Support • SQL Server security & administration • Rapid deployment of R models as stored procedures. • Python support. Platforms and Data Tools Language Algorithms Data Sources Operationalization Distributed Parallelized Algorithms: •RevoScaleR library •MicrosoftML library •Custom parallelization frameworks Open source R algorithms & visualizations: •CRAN •bioconductor Plus: •Deep Learning •Pretrained Models •Prebuilt Featurizers .csv Microsoft .XDF ODBC/JDBC
  • 49. (SQL Server 2017) • Python support • Microsoft Machine Learning package included • Process multiple related models in parallel with the rxExecBy function • Create a shared script library with R script package management • Native scoring with T-SQL PREDICT • In-place upgrade of R components
  • 50. Spark • Distribute RevoScaleR algorithms using Spark • Spark 1.6 or 2.0 Compatible • Load Spark Dataframes from: • Interop with SparkETL, SparkSQL • Support for HDInsight, Hortonworks, Cloudera & MapR Hadoop Platforms and Data Tools Language Algorithms Data Sources Operationalization Distributed Parallelized Algorithms: •RevoScaleR library •MicrosoftML library •Custom parallelization frameworks Open source R algorithms & visualizations: •CRAN •bioconductor Plus: •Deep Learning •Pretrained Models •Prebuilt Featurizers .csv Microsoft .XDF ODBC/JDBC
  • 51. Configuration: • HDI cluster size: 100 nodes - Edge node: D14 V2 (16 cores, 112GB) - Worker Nodes: D12 (4 cores, 28GB) • Dataset: NYC Taxi dataset, filtered, transformed, and duplicated • Number of columns: 19 • Format: CSV • fs.azure.selfthrottling.read.factor=1 -200 300 800 1300 1800 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime(seconds) Billions of rows rxLogit on a 100 node HDInsight cluster 2.2 TB
  • 53. IntelligentApps • Directly integrate intelligence into apps. • Streamlined real-time scoring: • Web Services operationalization Platforms and Data Tools Language Algorithms Data Sources Operationalization Mrsdeploy RESTful API deployment Real-Time Scoring Visualization Tool Integration In-database deployment
  • 54. Built-in remote execute functions in R Client/R Server Tools to reconcile local and remote Execute .R script or interactive R commands Return results to local Generate working snapshots for resume and reuse IDE agnostic R Server configured to (Support Window Server, Linux Server, Hadoop ) Remote Execute R Scripts  Execute R Scripts  Snapshot remote env.  Logout remote server  Login remote server  Generate Diff report  Reconcile Environment
  • 55. Agility • Visual Studio • Support for open source R Studio GUI • Support for JupyterR • Microsoft contributing to Rattle Rattle Platforms and Data Tools Language Algorithms Data Sources Operationalization
  • 58. recognized leader broadest R capabilities performance and scale architectural flexibility Rapid deployment Harnesses the hybrid cloud Scales R for big data workloads Blends best of open source and Microsoft technologies platform choice
  • 61. In-Database analytics with SQL Server In SQL Server 2016, Microsoft launched two server platforms for integrating the popular open source R language with business applications: • SQL Server R Services (In-Database), for integration with SQL Server • Machine Learning Server, for enterprise-level R deployments on Windows and Linux servers In SQL Server 2017, the name has been changed to reflect support for the popular Python language: • SQL Server Machine Learning Services (In-Database) supports both R and Python for in- database analytics • Microsoft Machine Learning Server supports R and Python deployments on Windows servers—expansion to other supported platforms is planned for late 2017
  • 62. Capability • Extensible in-database analytics, integrated with R, exposed through T-SQL • Centralized enterprise library for analytic models Benefits SQL Server Analytical engines Integrate with R/Python Data management layer Relational data Use T-SQL interface Stream data in-memory Analytics library Share and collaborate Manage and deploy R Data scientists Business analysts Publish algorithms, interact directly with data Analyze through T-SQL, tools, and vetted algorithms DBAs Manage storage and analytics together Machine Learning Services
  • 63. SQL Server 2017 setup Install Machine Learning Services (In-Database) Consent to install Microsoft R Open/Python Optional: Install R packages on SQL Server 2017 machine Database configuration Enable R language extension in database Configure path for RRO runtime in database Grant EXECUTE EXTERNAL SCRIPT permission to users CREATE EXTERNAL EXTENSION [R] USING SYSTEM LAUNCHER WITH (RUNTIME_PATH = 'c:revolutionbin‘) GRANT EXECUTE SCRIPT ON EXTERNAL EXTENSION::R TO DataScientistsRole; /* User-defined role / users */
  • 64. ML runtime usage Resource governance via resource pool Monitoring via DMVs Troubleshooting via XEvents/ DMVs CREATE RESOURCE POOL ML_runtimes FOR EXTERNAL EXTENSION WITH MAX_CPU_PERCENT = 20, MAX_MEMORY_PERCENT = 10; select * from sys.dm_resource_governor_resouce_pools where name = ‘ML_runtimes';
  • 65. • Original R script: • IrisPredict <- function(data, model){ • library(e1071) • predicted_species <- predict(model, data) • return(predicted_species) • } • • • library(RODBC) • conn <- odbcConnect("MySqlAzure", uid = myUser, pwd = myPassword); • Iris_data <-sqlFetch(conn, "Iris_Data"); • Iris_model <-sqlQuery(conn, "select model from my_iris_model"); • IrisPredict (Iris_data, model); • Calling R script from SQL Server: • /* Input table schema */ • create table Iris_Data (name varchar(100), length int, width int); • /* Model table schema */ • create table my_iris_model (model varbinary(max)); • • declare @iris_model varbinary(max) = (select model from my_iris_model); • exec sp_execute_external_script • @language = 'R' • , @script = ' • IrisPredict <- function(data, model){ • library(e1071) • predicted_species <- predict(model, data) • return(predicted_species) • } • IrisPredict(input_data_1, model); • ' • , @parallel = default • , @input_data_1 = N'select * from Iris_Data' • , @params = N'@model varbinary(max)' • , @model = @iris_model • with result sets ((name varchar(100), length int, width int • , species varchar(30))); • Values highlighted in yellow are SQL queries embedded in the original R script • Values highlighted in aqua are R variables that bind to SQL variables by name
  • 66. launchpad.exe sp_execute_external_script sqlservr.exe Named pipe SQLOS XEvent MSSQLSERVER Service MSSQLLAUNCHPAD Service (one per SQL Server instance) What and how to launch “launcher” Bxlserver.exe sqlsatellite.dll Bxlserver.exe sqlsatellite.dll Windows Satellite Process sqlsatellite.dll Run query
  • 67. • More efficient than standalone clients • Data does not all have to fit in memory • Reduced data transmission over the network • Most R Open (and Python R) functions are single threaded • We can stream data in parallel and batches from SQL Server to/from script • Use the power of SQL Server and ML to develop, train, and operationalize • SQL Server compute context (remote compute context) • T-SQL queries • Memory-optimized tables • Columnstore indexes • Data compression • Parallel query execution • Stored procedures
  • 68. Reduced surface area and isolation “external scripts enabled” is required Script execution outside of SQL Server process space Script execution requires explicit permission sp_execute_external_script requires EXECUTE ANY EXTERNAL SCRIPT for non-admins SQL Server login/user required and db/table access Satellite processes have limited privileges Satellite processes run under low privileged, local user accounts in the SQLRUserGroup Each execution is isolated — different users with different accounts Windows firewall rules block outbound traffic
  • 69. MicrosoftML is a package for Machine Learning Server, Microsoft R Client, and SQL Server Machine Learning Services that adds state-of-the-art data transforms, machine learning algorithms, and pretrained models to Microsoft R functionality. • Data transforms helps you to compose, in a pipeline, a custom set of transforms that are applied to your data before training or testing. The primary purpose of these transforms is to allow you to featurize your data. • Machine learning algorithms enable you to tackle common machine learning tasks such as classification, regression and anomaly detection. You run these high-performance functions locally on Windows or Linux machines or on Azure HDInsight (Hadoop/Spark) clusters. • Pretrained models for sentiment analysis and image featurization can also be installed and deployed with the MicrosoftML package.
  • 71. • Run R or Python inside SQL Server 2017 with ML Services • Existing SQL 2016 clients can R using SQL R Services SQL Server 2017 Run R engine from within the Query Processor
  • 72. SQL Server 2016/17 Run R From within the Query Processor Move BIG Work to the Data T-SQL Apps T-SQL Script
  • 73. T-SQL Stored ProcedureSQL Server 2016/17 Enable smart non-R apps BI & Reporting; Web apps T-SQL Script R Engine R script
  • 74. Large Data Sets in Chunks Remote Execution Context Results Parallel Worker Tasks Parallel Algorithm Iterate/ Sequence SQL ApplicationsT-SQL + SQL Server 2016/17
  • 77. Use familiar T-SQL stored procedures to invoke R scripts from your application
  • 78. • Turn R analytics  Web services in one line of code; • Swagger-based REST APIs, easy to consume, with any programming languages, including R! • Deploying web service server to any platform: Windows, SQL, Linux/Hadoop • On-prem or in cloud • Fast scoring, real time and batch • Scaling to a grid for powerful computing with load balancing • Diagnostic and capacity evaluation tools • Enterprise authentication: AD/LDAP or AAD • Secure connection: HTTPS with SSL/TLS 1.2 • Enterprise grade high availability Instant Deployment Deploy to Anywhere Fast and Scalable Secure and Reliable
  • 79. Data Scientist Developer Easy Integration Easy Deployment Easy Setup  In-cloud or on-prem  Adding nodes to scale  High availability & load balancing  Remote execution server Machine Learning Server configured for operationalizing R analytics Microsoft R Client (mrsdeploy package) Easy Consumption publishServiceMicrosoft R Client (mrsdeploy package) Data Scientist
  • 80. • Seamless integration with authentication solution: LDAP/AD/AAD • Secure connection: HTTPS encrypted by TLS 1.2/SSL • Compliance with Microsoft Security Development Lifecycle R Client
  • 81. Load Balancer • Server level HA: Introduce multiple Web Nodes for Active-Active backup / recovery, via load balancer • Data Store HA: leverage Enterprise grade DB, SQL Server and Postgres’ HA capabilities
  • 82. • Easily create web services from R scripts & models Build the model first Deploy as a web service instantly
  • 83. Function Description publishService Publish a predictive function as a Web Service deleteService Delete a Web Service getService Get a Web Service ListServices List the different published web services serviceOption Retrieve, set, and list the different service options updateService Updates a Web Service
  • 84. # Run the following code in R swagger <- api$swagger() cat(swagger, file = "swagger.json", append = FALSE) Generate Swagger Docs for Web Services Popular Swagger Tools: AutoRest or Code Generator AutoRest.exe -CodeGenerator CSharp -Modeler Swagger - Input swagger.json - Namespace Mynamespace Run Swagger tools to generate code Write a few code to consume the service Data Scientist DeveloperDeveloper
  • 86. • Easily scale up a single server to a grid to handle more concurrent requests • Load balancing cross compute nodes • Scale overall performance with shared pools of warmed up R shells. R Client
  • 87. Snapshot Functions createSnapshot Create a snapshot of the remote session (workspace and working directory) loadSnapshot Load a snapshot from the server into the remote session (workspace and working directory) listSnapshots Get a list of snapshots for the current user downloadSnapshot Download a snapshot from the server deleteSnapshot Delete a snapshot from the server Remote Objects Management listRemoteFiles Get a list of files in the working directory of the remote session deleteRemoteFile Delete a file from the working directory of the remote R session getRemoteFile Copy a file from the working directory of the remote R session putLocalFile Copy a file from the local machine to the working directory of the remote R session getRemoteObject Get an object from the remote R session putLocalObject Put an object from the local R session and load it into the remote R session getRemoteWorkspace Take all objects from the remote R session and load them into the local R session putLocalWorkspace Take all objects from the local R session and load them into the remote R session Remote Connection remoteLogin Remote login to the R Server with AD or admin credentials remoteLoginAAD Remote login to R Server server using Azure AD remoteLogout Logout of the remote session on the DeployR Server. Remote Execution remoteExecute Remote execution of either R code or an R script remoteScript Wrapper function for remote script execution diffLocalRemote Generate a 'diff' report between local and remote pause Pause remote connection and back to local resume Return the user to the 'REMOTE >' command prompt