SlideShare a Scribd company logo
Machine Learning with
SparkR
OLGUN AYDIN
SENIOR DATA SCIENTIST
olgun_aydin@epam.com
info@olgunaydin.com
About me
 BsC. And MsC. degree from Statistics
 6 years experienced Data Scientist
 6 years experience of R
 Love to use R, SparkR and Shiny
 Organizer of PyData Istanbul
 Co-organizer of Istanbul Spark Meetup
 Co-organizer of Trojimasto Spark Meetup
github.com/olgnaydn/R
www.linkedin.com/in/olgun-aydin/
twitter.com/olgunaydinn
https://www.packtpub.com/books/info/authors/olgun-aydin
Outline
 Introduction to Machine Learning
 SparkR
 Getting Data
 DataFrames
 Applications
Introduction to Machine Learning
 Machine learning is a field of computer science that uses statistical
techniques to give computer systems the ability to "learn" (e.g.,
progressively improve performance on a specific task) with data, without
being explicitly programmed. (Wikipedia)
 Machine learning is closely related to (and often overlaps with)
computational statistics, which also focuses on prediction-making through
the use of computers. It has strong ties to mathematical optimization,
which delivers methods, theory and application domains to the field.
Introduction to Machine Learning
 DeepMind developed an agent that surpassed human-level
performance at 49 Atari games, receiving only the pixels and game
score as inputs.
 Soon after, in 2016, DeepMind obsoleted their own this achievement
by releasing a new state-of-the-art gameplay method called A3C.
 Meanwhile, AlphaGo defeated one of the best human players at
Go—an extraordinary achievement in a game dominated by humans
for two decades after machines first conquered chess.
Introduction to Machine Learning
Introduction to Machine Learning
2
1
3
Examples for Real Life Applications
Internet Search
• Google, Bing, Yahoo, Ask
• Better results with data science algorithms
Recomendation
Systems
• Netflix, Amazon, Alibaba
Predictions
Systems
• Image recognition, speech recognition
• Fraud and Risk detection,Self driving cars, robots
Examples for Real Life Applications
Power of
 Fast
 Powerful
 Scalable
Power of
 Effective
 Number of Packages
 One of the Most prefered language
for statistical analysis
Power of
Power of
 Effective
 Powerful
 Statiscal Power
 Fast
 Scalable
+
 SparkR provides a frontend to Apache Spark and uses Spark’s distributed
computation engine to enable large scale data analysis from the R Shell.
 Data analysis using R is limited by the amount of memory available on a
single machine and further as R is single threaded it is often impractical to
use R on large datasets.
 R programs can be scaled while making it easy to use and deploy across a
number of workloads. SparkR: an R frontend for Apache Spark, a widely
deployed cluster computing engine. There are a number of benefits to
designing an R frontend that is tightly integrated with Spark.
 SparkR requires no changes to R. The central component of SparkR is a
distributed data frame that enables structured data processing with a
syntax familiar to R users.
 To improve performance over large datasets, SparkR performs lazy
evaluation on data frame operations and uses Spark’s relational query
optimizer to optimize execution.
 SparkR was initially developed at the AMPLab, UC Berkeley and has been a
part of the Apache Spark.
 The central component of SparkR is a distributed data frame implemented
on top of Spark.
 SparkR DataFrames have an API similar to dplyr or local R data frames, but
scale to large datasets using Spark’s execution engine and relational query
optimizer.
 SparkR’s read.df method integrates with Spark’s data source API and this
enables users to load data from systems like HBase, Cassandra etc. Having
loaded the data, users are then able to use a familiar syntax for performing
relational operations like selections, projections, aggregations and joins.
 Further, SparkR supports more than 100 pre-defined functions on
DataFrames including string manipulation methods, statistical functions
and date-time operations. Users can also execute SQL queries directly on
SparkR DataFrames using the sql command. SparkR also makes it easy for
users to chain commands using existing R libraries.
 Finally, SparkR DataFrames can be converted to a local R data frame using
the collect operator and this is useful for the big data, small learning
scenarios described earlier
Machine Learning with SparkR
 SparkR’s architecture consists of two main components an R to JVM
binding on the driver that allows R programs to submit jobs to a Spark
cluster and support for running R on the Spark executors.
Installation and Creating a SparkContext
 Step 1: Download Spark
 http://spark.apache.org/
Installation and Creating a SparkContext
 Step 1: Download Spark
http://spark.apache.org/
 Step 2: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark folder
 Step 3: Run in RStudio
Set System Environment. Once you have opened RStudio, you need to set the
system environment first. You have to point your R session to the installed version
of SparkR. Use the code shown in Figure 11 below but replace
the SPARK_HOME variable using the path to your Spark folder.
“C:/Apache/Spark-1.4.1″.
Getting Data
 From local data frames
 The simplest way to create a data frame is to convert a local R data frame
into a SparkR DataFrame. Specifically we can use createDataFrame and
pass in the local R data frame to create a SparkR DataFrame. As an
example, the following creates a DataFrame based using the faithful
dataset from R.
Getting Data
 From Data Sources
 SparkR supports operating on a variety of data sources through the DataFrame
interface. This section describes the general methods for loading and saving
data using Data Sources. You can check the Spark SQL programming guide for
more specific options that are available for the built-in data sources.
 The general method for creating DataFrames from data sources is read.df.
 This method takes in the SQLContext, the path for the file to load and the type
of data source.
 SparkR supports reading JSON and Parquet files natively and through Spark
Packages you can find data source connectors for popular file formats like CSV
and Avro.
Getting Data
 We can see how to use data sources using an example JSON input file.
Note that the file that is used here is not a typical JSON file. Each line in
the file must contain a separate, self-contained valid JSON object.
Getting Data
 From Hive tables
 You can also create SparkR DataFrames from Hive tables. To do this we will need to create
a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have
been built with Hive support and more details on the difference between SQLContext and
HiveContext can be found in the SQL programming guide.
SQL queries in SparkR
 A SparkR DataFrame can also be registered as a temporary table in Spark SQL and
registering a DataFrame as a table allows you to run SQL queries over its data. The sql
function enables applications to run SQL queries programmatically and returns the result
as a DataFrame.
DataFrames
 SparkR DataFrames support a number of functions to do structured data processing.
Here we include some basic examples and a complete list can be found in the API docs.
DataFrames
 SparkR data frames support a number of commonly used functions to aggregate data
after grouping. For example we can compute a histogram of the waiting time in the
faithful dataset as shown below
DataFrames
 SparkR also provides a number of functions that can directly applied to columns for data
processing and during aggregation. The example below shows the use of basic
arithmetic functions.
Applications
Correlation Analysis
K-Means
Decision Trees
Ad

More Related Content

What's hot (20)

SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
samuel shamiri
 
Heart Proposal
Heart ProposalHeart Proposal
Heart Proposal
Edward Yoon
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
Greg Hanchin
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
Xplenty
 
Modeling employees relationships with Apache Spark
Modeling employees relationships with Apache SparkModeling employees relationships with Apache Spark
Modeling employees relationships with Apache Spark
Wassim TRIFI
 
Technologies for Websites
Technologies for WebsitesTechnologies for Websites
Technologies for Websites
Compare Infobase Limited
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
projects_with_descriptions
projects_with_descriptionsprojects_with_descriptions
projects_with_descriptions
James Mission, CBIP
 
Splunk Architecture
Splunk ArchitectureSplunk Architecture
Splunk Architecture
Kishore Chaganti
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
dewang_mistry
 
Hive
HiveHive
Hive
Manas Nayak
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
Aucfan
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 
Big data analytics use case and software
Big data analytics use case and softwareBig data analytics use case and software
Big data analytics use case and software
Sandeep Sharma IIMK Smart City,IoT,Bigdata,Cloud,BI,DW
 
SparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big DataSparkR-Advance Analytic for Big Data
SparkR-Advance Analytic for Big Data
samuel shamiri
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
Takuya UESHIN
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
Splunk and map_reduce
Splunk and map_reduceSplunk and map_reduce
Splunk and map_reduce
Greg Hanchin
 
12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools12 SQL On-Hadoop Tools
12 SQL On-Hadoop Tools
Xplenty
 
Modeling employees relationships with Apache Spark
Modeling employees relationships with Apache SparkModeling employees relationships with Apache Spark
Modeling employees relationships with Apache Spark
Wassim TRIFI
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
Jayant Shekhar
 
Big Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data WorldBig Data Fundamentals in the Emerging New Data World
Big Data Fundamentals in the Emerging New Data World
Jongwook Woo
 
Streaming analytics state of the art
Streaming analytics state of the artStreaming analytics state of the art
Streaming analytics state of the art
Stavros Kontopoulos
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
dewang_mistry
 
Azure Data Factory usage at Aucfanlab
Azure Data Factory usage at AucfanlabAzure Data Factory usage at Aucfanlab
Azure Data Factory usage at Aucfanlab
Aucfan
 
Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -Aucfanlab Datalake - Big Data Management Platform -
Aucfanlab Datalake - Big Data Management Platform -
Aucfan
 

Similar to Machine Learning with SparkR (20)

Introduction to SparkR
Introduction to SparkRIntroduction to SparkR
Introduction to SparkR
Ankara Big Data Meetup
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
waqasm86
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
Happiest Minds Technologies
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
prateek kumar
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
Edureka!
 
Apache spark
Apache sparkApache spark
Apache spark
Dona Mary Philip
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
Spark1
Spark1Spark1
Spark1
Dr. G. Bharadwaja Kumar
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
Edureka!
 
SparkPaper
SparkPaperSparkPaper
SparkPaper
Suraj Thapaliya
 
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Simplilearn
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
waqasm86
 
Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)Getting Started with Apache Spark (Scala)
Getting Started with Apache Spark (Scala)
Knoldus Inc.
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
Jen Stirrup
 
Big data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle DatabaseBig data processing with Apache Spark and Oracle Database
Big data processing with Apache Spark and Oracle Database
Martin Toshev
 
spark interview questions & answers acadgild blogs
 spark interview questions & answers acadgild blogs spark interview questions & answers acadgild blogs
spark interview questions & answers acadgild blogs
prateek kumar
 
5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!5 Reasons why Spark is in demand!
5 Reasons why Spark is in demand!
Edureka!
 
5 things one must know about spark!
5 things one must know about spark!5 things one must know about spark!
5 things one must know about spark!
Edureka!
 
5 reasons why spark is in demand!
5 reasons why spark is in demand!5 reasons why spark is in demand!
5 reasons why spark is in demand!
Edureka!
 
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
Hands-on Guide to Apache Spark 3: Build Scalable Computing Engines for Batch ...
bemeneqhueen
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Using pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 previewUsing pySpark with Google Colab & Spark 3.0 preview
Using pySpark with Google Colab & Spark 3.0 preview
Mario Cartia
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
Databricks
 
Ad

Recently uploaded (20)

Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Medical Dataset including visualizations
Medical Dataset including visualizationsMedical Dataset including visualizations
Medical Dataset including visualizations
vishrut8750588758
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
chapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptxchapter 4 Variability statistical research .pptx
chapter 4 Variability statistical research .pptx
justinebandajbn
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
Ad

Machine Learning with SparkR

  • 2. About me  BsC. And MsC. degree from Statistics  6 years experienced Data Scientist  6 years experience of R  Love to use R, SparkR and Shiny  Organizer of PyData Istanbul  Co-organizer of Istanbul Spark Meetup  Co-organizer of Trojimasto Spark Meetup github.com/olgnaydn/R www.linkedin.com/in/olgun-aydin/ twitter.com/olgunaydinn https://www.packtpub.com/books/info/authors/olgun-aydin
  • 3. Outline  Introduction to Machine Learning  SparkR  Getting Data  DataFrames  Applications
  • 4. Introduction to Machine Learning  Machine learning is a field of computer science that uses statistical techniques to give computer systems the ability to "learn" (e.g., progressively improve performance on a specific task) with data, without being explicitly programmed. (Wikipedia)  Machine learning is closely related to (and often overlaps with) computational statistics, which also focuses on prediction-making through the use of computers. It has strong ties to mathematical optimization, which delivers methods, theory and application domains to the field.
  • 5. Introduction to Machine Learning  DeepMind developed an agent that surpassed human-level performance at 49 Atari games, receiving only the pixels and game score as inputs.  Soon after, in 2016, DeepMind obsoleted their own this achievement by releasing a new state-of-the-art gameplay method called A3C.  Meanwhile, AlphaGo defeated one of the best human players at Go—an extraordinary achievement in a game dominated by humans for two decades after machines first conquered chess.
  • 8. 2 1 3 Examples for Real Life Applications Internet Search • Google, Bing, Yahoo, Ask • Better results with data science algorithms Recomendation Systems • Netflix, Amazon, Alibaba Predictions Systems • Image recognition, speech recognition • Fraud and Risk detection,Self driving cars, robots
  • 9. Examples for Real Life Applications
  • 10. Power of  Fast  Powerful  Scalable
  • 11. Power of  Effective  Number of Packages  One of the Most prefered language for statistical analysis
  • 14.  Effective  Powerful  Statiscal Power  Fast  Scalable +
  • 15.  SparkR provides a frontend to Apache Spark and uses Spark’s distributed computation engine to enable large scale data analysis from the R Shell.
  • 16.  Data analysis using R is limited by the amount of memory available on a single machine and further as R is single threaded it is often impractical to use R on large datasets.  R programs can be scaled while making it easy to use and deploy across a number of workloads. SparkR: an R frontend for Apache Spark, a widely deployed cluster computing engine. There are a number of benefits to designing an R frontend that is tightly integrated with Spark.
  • 17.  SparkR requires no changes to R. The central component of SparkR is a distributed data frame that enables structured data processing with a syntax familiar to R users.  To improve performance over large datasets, SparkR performs lazy evaluation on data frame operations and uses Spark’s relational query optimizer to optimize execution.  SparkR was initially developed at the AMPLab, UC Berkeley and has been a part of the Apache Spark.
  • 18.  The central component of SparkR is a distributed data frame implemented on top of Spark.  SparkR DataFrames have an API similar to dplyr or local R data frames, but scale to large datasets using Spark’s execution engine and relational query optimizer.  SparkR’s read.df method integrates with Spark’s data source API and this enables users to load data from systems like HBase, Cassandra etc. Having loaded the data, users are then able to use a familiar syntax for performing relational operations like selections, projections, aggregations and joins.
  • 19.  Further, SparkR supports more than 100 pre-defined functions on DataFrames including string manipulation methods, statistical functions and date-time operations. Users can also execute SQL queries directly on SparkR DataFrames using the sql command. SparkR also makes it easy for users to chain commands using existing R libraries.  Finally, SparkR DataFrames can be converted to a local R data frame using the collect operator and this is useful for the big data, small learning scenarios described earlier
  • 21.  SparkR’s architecture consists of two main components an R to JVM binding on the driver that allows R programs to submit jobs to a Spark cluster and support for running R on the Spark executors.
  • 22. Installation and Creating a SparkContext  Step 1: Download Spark  http://spark.apache.org/
  • 23. Installation and Creating a SparkContext  Step 1: Download Spark http://spark.apache.org/  Step 2: Run in Command Prompt Now start your favorite command shell and change directory to your Spark folder  Step 3: Run in RStudio Set System Environment. Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. “C:/Apache/Spark-1.4.1″.
  • 24. Getting Data  From local data frames  The simplest way to create a data frame is to convert a local R data frame into a SparkR DataFrame. Specifically we can use createDataFrame and pass in the local R data frame to create a SparkR DataFrame. As an example, the following creates a DataFrame based using the faithful dataset from R.
  • 25. Getting Data  From Data Sources  SparkR supports operating on a variety of data sources through the DataFrame interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more specific options that are available for the built-in data sources.  The general method for creating DataFrames from data sources is read.df.  This method takes in the SQLContext, the path for the file to load and the type of data source.  SparkR supports reading JSON and Parquet files natively and through Spark Packages you can find data source connectors for popular file formats like CSV and Avro.
  • 26. Getting Data  We can see how to use data sources using an example JSON input file. Note that the file that is used here is not a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object.
  • 27. Getting Data  From Hive tables  You can also create SparkR DataFrames from Hive tables. To do this we will need to create a HiveContext which can access tables in the Hive MetaStore. Note that Spark should have been built with Hive support and more details on the difference between SQLContext and HiveContext can be found in the SQL programming guide.
  • 28. SQL queries in SparkR  A SparkR DataFrame can also be registered as a temporary table in Spark SQL and registering a DataFrame as a table allows you to run SQL queries over its data. The sql function enables applications to run SQL queries programmatically and returns the result as a DataFrame.
  • 29. DataFrames  SparkR DataFrames support a number of functions to do structured data processing. Here we include some basic examples and a complete list can be found in the API docs.
  • 30. DataFrames  SparkR data frames support a number of commonly used functions to aggregate data after grouping. For example we can compute a histogram of the waiting time in the faithful dataset as shown below
  • 31. DataFrames  SparkR also provides a number of functions that can directly applied to columns for data processing and during aggregation. The example below shows the use of basic arithmetic functions.