SlideShare a Scribd company logo
SparkR + Zeppelin
Seattle Spark Meetup
Sept 9, 2015
Felix Cheung
Agenda
• R & SparkR
• SparkR DataFrame
• SparkR in Zeppelin
• What’s next
R• A programming language for statistical computing and
graphics
• S – 1975
• S4 - advanced object-oriented features
• R – 1993
• S + lexical scoping
• Interpreted
• Matrix arithmetic
• Comprehensive R Archive Network (CRAN) – 7000+ packages
Fast!
Scalable
Flexible
Statistical!
Interactive
Packages
SparkR
• R language APIs for Spark and Spark SQL
• Exposes Spark functionality in an R-friendly DataFrame API
• Runs as its own REPL sparkR
• or as a standard R package imported in tools like Rstudio
library(SparkR)
sc <- sparkR.init()
sqlContext <- sparkRSQL.init(sc)
5
History
• Shivaram Venkataraman & Zongheng Yang,
amplab – UC Berkeley
• RDD APIs in a standalone package (Jan/2014)
• Spark SQL and SchemaRDD -> DataFrame
• Spark 1.4 – first Spark release with SparkR APIs
• Spark 1.5 (today!)
6
Architecture
7
Native S4
classes &
methods
RBackend
socket
• A set of native S4 classes and methods that live inside a
standard R package
• A backend that passes data structures and method calls to
Spark Scala/JVM
• A collection of “helper” methods written in Scala
Advantages
• R-like syntax extending DataFrame API
• JVM processing with full access to Spark’s DAG capabilities
and Catalyst engine,
e.g. execution plan optimization, constant-folding, predicate
pushdown, and code generation
8
https://databricks.com/blog/201
5/06/09/announcing-sparkr-r-
on-spark.html
SparkR DataFrame
• Spark packages
• Data Source API
• Optimizations
SparkR in Zeppelin
Architecture
R
R adaptor
Demo
DIY
• https://github.com/felixcheung/vagrant-
projects/tree/master/SparkR-Zeppelin
• Vagrant + VirtualBox
• Install prerequisites: JDK, R, R packages
• Automatically download Spark 1.5.0 release
• Need to build Zeppelin from
https://github.com/felixcheung/incubator-zeppelin/tree/r
• Notebook from https://github.com/felixcheung/spark-
notebook-
examples/blob/master/Zeppelin_notebook/2AZ9584GE/not
e.json
(extracted from the demo)
Native R
(extracted from the demo)
Native R and dplyr...
Similarly SparkR DataFrame…
(extracted from the demo)
SparkR DataFrame…
What’s new
• Zeppelin - run with provided Spark (SPARK_HOME)
• Spark 1.5.0 release
• SparkR new APIs
SparkR in Spark 1.5.0
Get this today:
• R formula
• Machine learning like GLM
model <- glm(Sepal_Length ~ Sepal_Width +
Species, data = df, family = "gaussian")
• More R-like
df[df$age %in% c(19, 30), 1:2]
transform(df, newCol = df$col1 / 5, newCol2 =
df$col1 * 2)
Zeppelin
• Stay tuned! More to come with R/SparkR
• Lots of updates in the upcoming 0.5.x/0.6.0 release
Question?
https://github.com/felixcheung
linkedin: http://linkd.in/1OeZDb7
blog: http://bit.ly/1E2z6OI
SparkR + Zeppelin
subset
# Columns can be selected using `[[` and `[`
df[[2]] == df[["age"]]
df[,2] == df[,"age"]
df[,c("name", "age")]
# Or to filter rows
df[df$age > 20,]
# DataFrame can be subset on both rows and Columns
df[df$name == "Smith", c(1,2)]
df[df$age %in% c(19, 30), 1:2]
subset(df, df$age %in% c(19, 30), 1:2)
subset(df, df$age %in% c(19), select = c(1,2))
Transform/mutate
newDF <- mutate(df, newCol = df$col1 * 5, newCol2 = df$col1 * 2)
newDF2 <- transform(df, newCol = df$col1 / 5, newCol2 = df$col1 * 2)

More Related Content

PPTX
Data Science with Spark & Zeppelin
Vinay Shukla
 
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
PDF
Spark Summit EU talk by Jakub Hava
Spark Summit
 
PDF
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
PDF
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
 
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
PDF
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Data Science with Spark & Zeppelin
Vinay Shukla
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Spark Summit EU talk by Jakub Hava
Spark Summit
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit
 
Spark Summit EU talk by Yiannis Gkoufas
Spark Summit
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Databricks
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Spark Summit
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 

What's hot (20)

PDF
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
PDF
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
PDF
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
PDF
Spark Summit EU talk by Jim Dowling
Spark Summit
 
PDF
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
PDF
Spark Summit EU talk by Simon Whitear
Spark Summit
 
PDF
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
PDF
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
PDF
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
PDF
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
PDF
Operational Tips For Deploying Apache Spark
Databricks
 
PDF
Apache Zeppelin, Helium and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
PDF
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
PDF
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
PDF
Spark Summit EU talk by John Musser
Spark Summit
 
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Databricks
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Databricks
 
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit
 
Building a Business Logic Translation Engine with Spark Streaming for Communi...
Spark Summit
 
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Summit
 
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Apache Spark Usage in the Open Source Ecosystem
Databricks
 
Spark Summit EU talk by Simon Whitear
Spark Summit
 
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Spark Summit
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Databricks
 
Apache® Spark™ 1.5 presented by Databricks co-founder Patrick Wendell
Databricks
 
Operational Tips For Deploying Apache Spark
Databricks
 
Apache Zeppelin, Helium and Beyond
DataWorks Summit/Hadoop Summit
 
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...
Spark Summit
 
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Anya Bida
 
Spark Summit EU talk by Elena Lazovik
Spark Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Spark Summit EU talk by John Musser
Spark Summit
 
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Databricks
 
Ad

Similar to SparkR + Zeppelin (20)

PDF
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
PDF
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
PDF
Scalable Data Science with SparkR
DataWorks Summit
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
Enabling exploratory data science with Spark and R
Databricks
 
PDF
Introduction to SparkR
Olgun Aydın
 
PDF
Introduction to SparkR
Ankara Big Data Meetup
 
PPTX
Machine Learning with SparkR
Olgun Aydın
 
PDF
Recent Developments In SparkR For Advanced Analytics
Databricks
 
PDF
Final_show
Nitay Alon
 
PDF
Big Data visualization with Apache Spark and Zeppelin
prajods
 
PPT
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
PDF
Parallelizing Existing R Packages
Craig Warman
 
PDF
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
PDF
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
PDF
Sparkr sigmod
waqasm86
 
PPTX
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
PPTX
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
PDF
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
PDF
Big data analysis using spark r published
Dipendra Kusi
 
A Data Frame Abstraction Layer for SparkR-(Chris Freeman, Alteryx)
Spark Summit
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Spark Summit
 
Scalable Data Science with SparkR
DataWorks Summit
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Enabling exploratory data science with Spark and R
Databricks
 
Introduction to SparkR
Olgun Aydın
 
Introduction to SparkR
Ankara Big Data Meetup
 
Machine Learning with SparkR
Olgun Aydın
 
Recent Developments In SparkR For Advanced Analytics
Databricks
 
Final_show
Nitay Alon
 
Big Data visualization with Apache Spark and Zeppelin
prajods
 
Apache spark-melbourne-april-2015-meetup
Ned Shawa
 
Parallelizing Existing R Packages
Craig Warman
 
SparkR: The Past, the Present and the Future-(Shivaram Venkataraman and Rui S...
Spark Summit
 
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Sparkr sigmod
waqasm86
 
ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark
Carolyn Duby
 
Crash Course HS16Melb - Hands on Intro to Spark & Zeppelin
DataWorks Summit/Hadoop Summit
 
SparkR: Enabling Interactive Data Science at Scale
jeykottalam
 
Big data analysis using spark r published
Dipendra Kusi
 
Ad

Recently uploaded (20)

PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
PPTX
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
PPTX
Introduction to Data Analytics and Data Science
KavithaCIT
 
PPTX
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
PPTX
Probability systematic sampling methods.pptx
PrakashRajput19
 
PPTX
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPTX
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PDF
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PPTX
INFO8116 -Big data architecture and analytics
guddipatel10
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
The whitetiger novel review for collegeassignment.pptx
DhruvPatel754154
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
lecture 13 mind test academy it skills.pptx
ggesjmrasoolpark
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
MR and reffffffvvvvvvvfversal_083605.pptx
manjeshjain
 
Introduction to Data Analytics and Data Science
KavithaCIT
 
INFO8116 - Week 10 - Slides.pptx data analutics
guddipatel10
 
Probability systematic sampling methods.pptx
PrakashRajput19
 
Fluvial_Civilizations_Presentation (1).pptx
alisslovemendoza7
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
M1-T1.pptxM1-T1.pptxM1-T1.pptxM1-T1.pptx
teodoroferiarevanojr
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Blitz Campinas - Dia 24 de maio - Piettro.pdf
fabigreek
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Presentation on animal welfare a good topic
kidscream385
 
INFO8116 -Big data architecture and analytics
guddipatel10
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 

SparkR + Zeppelin

Editor's Notes

  • #6: InRstudio: .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
  • #14: Use the viewer to check out the notebook without running Zeppelin: https://www.zeppelinhub.com/viewer/
  • #15: Retail employment, in millions (2008-2014) Source: Bureau of Labor Statistics Credit: NPR
  • #23: dplyr-like syntax