SlideShare a Scribd company logo
Introduction to R and 
integration of SparkR and 
Spark’s MLlib 
Dang Trung Kien
About me 
• Statistics undergraduate 
• trungkiendang@hotmail.com
R
What is R? 
• Statistical Programming Language 
• Open source 
• > 6000 available packages 
• widely used in academics and research
Introduction to SparkR
Companies that use R 
• Facebook 
• Google 
• Foursquare 
• Ford 
• Bank of America 
• ANZ 
• …
Data types 
• Vector 
• Matrix 
• List 
• Data frame
Vector 
• c(1, 2, 3, 4) 
## [1] 1 2 3 4 
• 1:4 
## [1] 1 2 3 4 
• c("a", "b", "c") 
## [1] "a" "b" “c" 
• c(T, F, T) 
## [1] TRUE FALSE TRUE
Matrix 
• matrix(c(1, 2, 3, 4), ncol=2) 
## [,1] [,2] 
## [1,] 1 3 
## [2,] 2 4 
• matrix(c(1, 2, 3, 4), ncol=2, byrow=T) 
## [,1] [,2] 
## [1,] 1 2 
## [2,] 3 4
List 
• list(12, “twelve") 
## [[1]] 
## [1] 12 
## 
## [[2]] 
## [1] "twelve"
Data frame 
name <- c("A", "B", “C") 
age <- c(30, 17, 42) 
male <- c(T, F, F) 
data.frame(name, age, male) 
## name age male 
## 1 A 30 TRUE 
## 2 B 17 FALSE 
## 3 C 42 FALSE
x <- 1:100 
y <- 1:100 + runif(100, 0, 20) 
m <- lm(y~x) 
plot(y~x) 
abline(m$coefficients)
Introduction to SparkR
But… 
• R is single-threaded 
• Can only process data sets that fit in a single 
machine
SparkR
SparkR 
• An R package that provides a light-weight front-end 
to use Apache Spark from R 
• exposes the RDD API of Spark as distributed lists 
in R 
• allows users to interactively run jobs from the R 
shell on a cluster
Spark 
count 
countByKey 
countByValue 
flatMap 
map (lapply) 
… 
broadcast 
includePackage 
… 
Filter 
reduce 
reduceByKey 
distinct 
union 
… 
+ R
Data flow 
Local 
Worker 
Worker 
Worker 
R Spark 
Context 
Java 
Spark 
Context 
Spark R 
Executer 
JNI
Word count 
lines <- textFile(sc, “/path/to/file") 
words <- flatMap(lines, 
function(line) { 
strsplit(line, " “)[[1]] 
}) 
wordCount <- lapply(words, function(word) { list(word, 1L) }) 
counts <- reduceByKey(wordCount, "+", 2L) 
output <- collect(counts) 
for (wordcount in output) { 
cat(wordcount[[1]], ": ", wordcount[[2]], “n") 
}
SparkR and Spark’s 
MLlib
Machine Learning 
• Arthur Samuel (1959): Field of study that gives 
computers the ability to learn without being 
explicitly programmed.
Machine Learning 
• Supervised 
Labels, features 
Mapping of features to labels 
Estimate a concept (model) that is closest to the true mapping 
• Unsupervised 
No labels 
Clustering of data
Machine Learning 
• Supervised 
Naive Bayes, nearest neighbour, decision tree, 
linear regression, support vector machine… 
• Unsupervised 
K-means, DBSCAN, one-class SVM…
Supervised
Supervised 
• Classification 
Cat or dog?
Supervised 
• Classification 
Cat or dog? 
• Regression 
Age?
Unsupervised
Naive Bayes 
• Supervised machine learning 
• Classifies texts based on word frequency
Naive Bayes 
Π 
P(class | doc) = P(class) P(word | class) 
word in doc 
Π 
class argmax P(class | doc) = 
class argmax P(class) P(word | class) 
word in doc 
class argmax log(P(class | doc)) 
argmax log(P(class))+ Σ 
log(P(word | class)) 
class word in doc 
P(c) = number of class c documents in training sets 
total number of documents in training sets 
P(w | c) = no. of occurences of word w in documents type c + 1 
total no. of words in documents type c + size of vocab
“a" “b” “c” 
1 1 1 0 
2 0 2 1 
P(1) = P(2) = 1 
2 
P(a |1) = 1+1 
1+1+ 3 
= 2 
5 
P(b |1) = 2 
5 
P(c |1) = 1 
5 
P(a | 2) = 1 
5 
P(b | 2) = 3 
5 
P(c | 2) = 2 
5 
P(1| "a b b") = 1 
2 
× 2 
5 
× 2 
5 
× 2 
5 
= 0.032 
P(2 | "a b b") = 1 
2 
× 1 
5 
× 3 
5 
× 3 
5 
= 0.036
MLlib 
• Spark’s scalable machine learning library 
consisting of common learning algorithms and 
utilities, including classification, regression, 
clustering, collaborative filtering, dimensionality 
reduction, as well as underlying optimization 
primitives.
MLlib and SparkR 
• Currently access to MLlib in SparkR is still in 
development. Thus use this method to run MLlib in 
R until MLlib is officially integrated into SparkR.
MLlib’s Naive Bayes in R 
R RDD of list(label, 
features) 
Java RDD of 
serialised R objects 
Scala RDD of 
LabeledPoint 
rJava 
J("org.apache.spark.mllib.classification.NaiveBayes", "train", 
labeled.point.rdd, lambda)
Demo
Thank you for coming!

More Related Content

What's hot (20)

PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
PDF
SparkR-Advance Analytic for Big Data
samuel shamiri
 
PDF
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
PDF
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Introduction to Spark (Intern Event Presentation)
Databricks
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PDF
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
PDF
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PDF
Jump Start into Apache® Spark™ and Databricks
Databricks
 
PPTX
Use r tutorial part1, introduction to sparkr
Databricks
 
PDF
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
PPTX
Spark tutorial
Sahan Bulathwela
 
PDF
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
 
SparkR-Advance Analytic for Big Data
samuel shamiri
 
Adding Complex Data to Spark Stack by Tug Grall
Spark Summit
 
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Introduction to Spark (Intern Event Presentation)
Databricks
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Strata NYC 2015 - Supercharging R with Apache Spark
Databricks
 
New directions for Apache Spark in 2015
Databricks
 
SparkR - Play Spark Using R (20160909 HadoopCon)
wqchen
 
Apache® Spark™ 1.6 presented by Databricks co-founder Patrick Wendell
Databricks
 
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Jump Start into Apache® Spark™ and Databricks
Databricks
 
Use r tutorial part1, introduction to sparkr
Databricks
 
SparkSQL: A Compiler from Queries to RDDs
Databricks
 
Spark tutorial
Sahan Bulathwela
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Spark Summit
 

Similar to Introduction to SparkR (20)

PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PPTX
Apache Spark MLlib
Zahra Eskandari
 
PPTX
MLconf NYC Xiangrui Meng
MLconf
 
PDF
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
PDF
Spark MLlib and Viral Tweets
Asim Jalis
 
PDF
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
PPTX
DATA MINING USING R (1).pptx
myworld93
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
Spark meetup TCHUG
Ryan Bosshart
 
PDF
Porting R Models into Scala Spark
carl_pulley
 
PDF
Spark m llib
Milad Alshomary
 
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
PDF
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
PDF
SparkR best practices for R data scientist
DataWorks Summit
 
Scalable Data Science with SparkR
DataWorks Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Machine Learning with SparkR
Olgun Aydın
 
Apache Spark MLlib
Zahra Eskandari
 
MLconf NYC Xiangrui Meng
MLconf
 
Apache® Spark™ MLlib: From Quick Start to Scikit-Learn
Databricks
 
Spark MLlib and Viral Tweets
Asim Jalis
 
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
DATA MINING USING R (1).pptx
myworld93
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Spark meetup TCHUG
Ryan Bosshart
 
Porting R Models into Scala Spark
carl_pulley
 
Spark m llib
Milad Alshomary
 
Apache Spark's MLlib's Past Trajectory and new Directions
Databricks
 
Integrate SparkR with existing R packages to accelerate data science workflows
Artem Ervits
 
SparkR best practices for R data scientist
DataWorks Summit
 
Ad

Recently uploaded (20)

PPTX
Krezentios memories in college data.pptx
notknown9
 
PPTX
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PPTX
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PPTX
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
PDF
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
DOCX
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PDF
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
PPTX
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
microservices-with-container-apps-dapr.pptx
vjay22
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PPTX
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
PDF
SaleServicereport and SaleServicereport
2251330007
 
PPTX
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
Krezentios memories in college data.pptx
notknown9
 
Data anlytics Hospitals Research India.pptx
SayantanChakravorty2
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
Generative AI Boost Data Governance and Quality- Tejasvi Addagada
Tejasvi Addagada
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
Artificial intelligence Presentation1.pptx
SaritaMahajan5
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
办理学历认证InformaticsLetter新加坡英华美学院毕业证书,Informatics成绩单
Taqyea
 
2025 Global Data Summit - FOM with AI.pdf
Marco Wobben
 
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
Data Science Course Certificate by Sigma Software University
Stepan Kalika
 
01_Nico Vincent_Sailpeak.pptx_AI_Barometer_2025
FinTech Belgium
 
microservices-with-container-apps-dapr.pptx
vjay22
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
Discrete Logarithm Problem in Cryptography (1).pptx
meshablinx38
 
SaleServicereport and SaleServicereport
2251330007
 
Presentation.pptx hhgihyugyygyijguuffddfffffff
abhiruppal2007
 
Ad

Introduction to SparkR

  • 1. Introduction to R and integration of SparkR and Spark’s MLlib Dang Trung Kien
  • 2. About me • Statistics undergraduate • [email protected]
  • 3. R
  • 4. What is R? • Statistical Programming Language • Open source • > 6000 available packages • widely used in academics and research
  • 6. Companies that use R • Facebook • Google • Foursquare • Ford • Bank of America • ANZ • …
  • 7. Data types • Vector • Matrix • List • Data frame
  • 8. Vector • c(1, 2, 3, 4) ## [1] 1 2 3 4 • 1:4 ## [1] 1 2 3 4 • c("a", "b", "c") ## [1] "a" "b" “c" • c(T, F, T) ## [1] TRUE FALSE TRUE
  • 9. Matrix • matrix(c(1, 2, 3, 4), ncol=2) ## [,1] [,2] ## [1,] 1 3 ## [2,] 2 4 • matrix(c(1, 2, 3, 4), ncol=2, byrow=T) ## [,1] [,2] ## [1,] 1 2 ## [2,] 3 4
  • 10. List • list(12, “twelve") ## [[1]] ## [1] 12 ## ## [[2]] ## [1] "twelve"
  • 11. Data frame name <- c("A", "B", “C") age <- c(30, 17, 42) male <- c(T, F, F) data.frame(name, age, male) ## name age male ## 1 A 30 TRUE ## 2 B 17 FALSE ## 3 C 42 FALSE
  • 12. x <- 1:100 y <- 1:100 + runif(100, 0, 20) m <- lm(y~x) plot(y~x) abline(m$coefficients)
  • 14. But… • R is single-threaded • Can only process data sets that fit in a single machine
  • 16. SparkR • An R package that provides a light-weight front-end to use Apache Spark from R • exposes the RDD API of Spark as distributed lists in R • allows users to interactively run jobs from the R shell on a cluster
  • 17. Spark count countByKey countByValue flatMap map (lapply) … broadcast includePackage … Filter reduce reduceByKey distinct union … + R
  • 18. Data flow Local Worker Worker Worker R Spark Context Java Spark Context Spark R Executer JNI
  • 19. Word count lines <- textFile(sc, “/path/to/file") words <- flatMap(lines, function(line) { strsplit(line, " “)[[1]] }) wordCount <- lapply(words, function(word) { list(word, 1L) }) counts <- reduceByKey(wordCount, "+", 2L) output <- collect(counts) for (wordcount in output) { cat(wordcount[[1]], ": ", wordcount[[2]], “n") }
  • 21. Machine Learning • Arthur Samuel (1959): Field of study that gives computers the ability to learn without being explicitly programmed.
  • 22. Machine Learning • Supervised Labels, features Mapping of features to labels Estimate a concept (model) that is closest to the true mapping • Unsupervised No labels Clustering of data
  • 23. Machine Learning • Supervised Naive Bayes, nearest neighbour, decision tree, linear regression, support vector machine… • Unsupervised K-means, DBSCAN, one-class SVM…
  • 26. Supervised • Classification Cat or dog? • Regression Age?
  • 28. Naive Bayes • Supervised machine learning • Classifies texts based on word frequency
  • 29. Naive Bayes Π P(class | doc) = P(class) P(word | class) word in doc Π class argmax P(class | doc) = class argmax P(class) P(word | class) word in doc class argmax log(P(class | doc)) argmax log(P(class))+ Σ log(P(word | class)) class word in doc P(c) = number of class c documents in training sets total number of documents in training sets P(w | c) = no. of occurences of word w in documents type c + 1 total no. of words in documents type c + size of vocab
  • 30. “a" “b” “c” 1 1 1 0 2 0 2 1 P(1) = P(2) = 1 2 P(a |1) = 1+1 1+1+ 3 = 2 5 P(b |1) = 2 5 P(c |1) = 1 5 P(a | 2) = 1 5 P(b | 2) = 3 5 P(c | 2) = 2 5 P(1| "a b b") = 1 2 × 2 5 × 2 5 × 2 5 = 0.032 P(2 | "a b b") = 1 2 × 1 5 × 3 5 × 3 5 = 0.036
  • 31. MLlib • Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives.
  • 32. MLlib and SparkR • Currently access to MLlib in SparkR is still in development. Thus use this method to run MLlib in R until MLlib is officially integrated into SparkR.
  • 33. MLlib’s Naive Bayes in R R RDD of list(label, features) Java RDD of serialised R objects Scala RDD of LabeledPoint rJava J("org.apache.spark.mllib.classification.NaiveBayes", "train", labeled.point.rdd, lambda)
  • 34. Demo
  • 35. Thank you for coming!