SlideShare a Scribd company logo
1 
Scalable Analytics 
with 
R, Hadoop and RHadoop 
Gwen Shapira, Software Engineer 
@gwenshap 
gshapira@cloudera.com
2
3
4
#include warning.h 
5
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
6
Get Started with R-Studio 
7
Basic Data Types 
• String 
• Number 
• Boolean 
• Assignment <- 
8
R can be a nice calculator 
> x <- 1 
> x * 2 
[1] 2 
> y <- x + 3 
> y 
[1] 4 
> log(y) 
[1] 1.386294 
> help(log) 
9
Complex Data Types 
• Vector 
• c, seq, rep, [] 
• List 
• Data Frame 
• Lists of vectors of same length 
• Not a matrix 
10
Creating vectors 
> v1 <- c(1,2,3,4) 
[1] 1 2 3 4 
> v1 * 4 
[1] 4 8 12 16 
> v4 <- c(1:5) 
[1] 1 2 3 4 5 
> v2 <- seq(2,12,by=3) 
[1] 2 5 8 11 
> v1 * v2 
[1] 2 10 24 44 
> v3 <- rep(3,4) 
[1] 3 3 3 3 
11
Accessing and filtering vectors 
> v1 <- c(2,4,6,8) 
[1] 2 4 6 8 
> v1[2] 
[1] 4 
> v1[2:4] 
[1] 4 6 8 
> v1[-2] 
[1] 2 6 8 
> v1[v1>3] 
[1] 4 6 8 
12
Lists 
> lst <- list (1,"x",FALSE) 
[[1]] 
[1] 1 
[[2]] 
[1] "x" 
[[3]] 
[1] FALSE 
> lst[1] 
[[1]] 
[1] 1 
> lst[[1]] 
[1] 1 
13
Data Frames 
books <- read.csv("~/books.csv") 
books[1,] 
books[,1] 
books[3:4] 
books$price 
books[books$price==6.99,] 
martin_price <- books[books$author_t=="George 
R.R. Martin",]$price 
mean(martin_price) 
subset(books,select=-c(id,cat,sequence_i)) 
14
15
Functions 
> sq <- function(x) { x*x } 
> sq(3) 
[1] 9 
16 
Note: 
R is a functional programming language. 
Functions are first class objects 
And can be passed to other functions.
packages 
17
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation 
• Rhadoop 
18
“In pioneer days they used oxen for heavy 
pulling, and when one ox couldn’t budge a log, 
we didn’t try to grow a larger ox” 
— Grace Hopper, early advocate of distributed computing
20 
Hadoop in a Nutshell
Map-Reduce is the interesting bit 
• Map – Apply a function to each input record 
• Shuffle & Sort – Partition the map output and sort 
each partition 
• Reduce – Apply aggregation function to all values in 
each partition 
• Map reads input from disk 
• Reduce writes output to disk 
21
Example – Sessionize clickstream 
22
Sessionize 
Identify unique “sessions” of interacting with our 
website 
Session – for each user (IP), set of clicks that happened 
within 30 minutes of each other 
23
Input – Apache Access Log Records 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 
24
Output – Add Session ID 
127.0.0.1 - frank 
[10/Oct/2000:13:55:36 -0700] 
"GET /apache_pb.gif HTTP/1.0" 
200 2326 15 
25
Overview 
26 
Map 
Map 
Map 
Reduce 
Reduce 
Log line 
Log line 
Log line 
IP1, log lines 
Log line, session ID
Map 
parsedRecord = re.search(‘(d+.d+….’,record) 
IP = parsedRecord.group(1) 
timestamp = parsedRecord.group(2) 
print ((IP,Timestamp),record) 
27
Shuffle & Sort 
Partition by: IP 
Sort by: timestamp 
Now reduce gets: 
(IP,timestamp) [record1,record2,record3….] 
28
Reduce 
SessionID = 1 
curr_record = records[0] 
Curr_timestamp = getTimestamp(curr_record) 
foreach record in records: 
if (curr_timestamp – getTimestamp(record) > 30): 
sessionID += 1 
curr_timestamp = getTimestamp(record) 
print(record + “ “ + sessionID) 
29
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
30
Reshape2 
• Two functions: 
• Melt – wide format to long format 
• Cast – long format to wide format 
• Columns: identifiers or measured variables 
• Molten data: 
• Unique identifiers 
• New column – variable name 
• New column – value 
• Default – all numbers are values 
31
Melt 
> tips 
total_bill tip sex smoker day time size 
16.99 1.01 Female No Sun Dinner 2 
10.34 1.66 Male No Sun Dinner 3 
21.01 3.50 Male No Sun Dinner 3 
> melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
32
Cast 
> m_tips <- melt(tips) 
sex smoker day time variable value 
Female No Sun Dinner total_bill 16.99 
Female No Sun Dinner tip 1.01 
Female No Sun Dinner size 2 
> dcast(m_tips,sex+time~variable,mean) 
sex time total_bill tip size 
Female Dinner 19.21308 3.002115 2.461538 
Female Lunch 16.33914 2.582857 2.457143 
Male Dinner 21.46145 3.144839 2.701613 
Male Lunch 18.04848 2.882121 2.363636 
33
*Apply 
• apply – apply function on rows or columns of matrix 
• lapply – apply function on each item of list 
• Returns list 
• sapply – like lapply, but return vector 
• tapply – apply function to subsets of vector or lists 
34
plyr 
• Split – apply – combine 
• Ddply – data frame to data frame 
ddply(.data, .variables, .fun = NULL, ..., 
• Summarize – aggregate data into new data frame 
• Transform – modify data frame 
35
DDPLY Example 
> ddply(tips,c("sex","time"),summarize, 
+ mean=mean(tip), 
+ sd=sd(tip), 
+ ratio=mean(tip/total_bill) 
+ ) 
sex time mean sd ratio 
1 Female Dinner 3.002115 1.193483 0.1693216 
2 Female Lunch 2.582857 1.075108 0.1622849 
3 Male Dinner 3.144839 1.529116 0.1554065 
4 Male Lunch 2.882121 1.329017 0.1660826 
36
Agenda 
• R Basics 
• Hadoop Basics 
• Data Manipulation Libraries 
• Rhadoop 
37
Rhadoop Projects 
• RMR 
• RHDFS 
• RHBase 
• (new) PlyRMR 
38
Most Important: 
RMR does not parallelize algorithms. 
It allows you to implement MapReduce in R. 
Efficiently. That’s it. 
39
What does that mean? 
• Use RMR if you can break your problem down to 
small pieces and apply the algorithm there 
• Use commercial R+Hadoop if you need a parallel 
version of well known algorithm 
• Good fit: Fit piecewise regression model for each 
county in the US 
• Bad fit: Fit piecewise regression model for the entire 
US population 
• Bad fit: Logistic regression 
40
Use-case examples – Good or Bad? 
1. Model power consumption per household to 
determine if incentive programs work 
2. Aggregate corn yield per 10x10 portion of field to 
determine best seeds to use 
3. Create churn models for service subscribers and 
determine who is most likely to cancel 
4. Determine correlation between device restarts and 
support calls 
41
Second Most Important: 
RMR requires R, RMR and all libraries you’ll 
use to be installed on all nodes and 
accessible by Hadoop user 
42
RMR is different from Hadoop Streaming. 
RMR mapper input: 
Key, [List of Records] 
This is so we can use vector operations 
43
How to RMRify a Problem 
44
In more detail… 
• Mappers get list of values 
• You need to process each one independently 
• But do it for all lines at once. 
• Reducers work normally 
45
Demo 6 
> library(rmr2) 
t <- list("hello world","don't worry be happy") 
unlist(sapply(t,function (x) {strsplit(x," ")})) 
function(k,v) { 
ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) 
keyval(ret_k,1) 
} 
function(k,v) { 
keyval(k,sum(v))} 
mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", 
output=”~/wc.json",input.format="text”,output.format=”json", 
map=wc.map,reduce=wc.reduce); 
46
Cheating in MapReduce: 
Do everything possible to have 
map only jobs 
47
Avg Tips per Person – Naïve Input 
Gwen 1 
Jeff 2 
Leon 1 
Gwen 2.5 
Leon 3 
Jeff 1 
Gwen 1 
Gwen 2 
Jeff 1.5 
48
Avg Tips per Person - Naive 
avg.map <- function(k,v){keyval(v$V1,v$V2)} 
avg.reduce <- function(k,v) {keyval(k,mean(v))} 
mapreduce(input=”~/hadoop-recipes/data/tip1.txt", 
output="~/avg.txt", 
input.format=make.input.format("csv"), 
output.format="text", 
map=avg.map,reduce=avg.reduce); 
49
Avg Tips per Person – Awesome Input 
Gwen 1,2.5,1,2 
Jeff 2,1,1.5 
Leon 1,3 
50
Avg Tips per Person - Optimized 
function(k,v) { 
v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," 
")})) 
keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) 
} 
mapreduce(input=”~/hadoop-recipes/data/tip2.txt", 
output="~/avg2.txt", 
input.format=make.input.format("csv",sep=","), 
output.format="text",map=avg2.map); 
51
Few Final RMR Tips 
• Backend = “local” has files as input and output 
• Backend = “hadoop” uses HDFS directories 
• In “hadoop” mode, print(X) inside the mapper will fail 
the job. 
• Use: cat(“ERROR!”, file = stderr()) 
52
Recommended Reading 
• http://cran.r-project.org/doc/manuals/R-intro.html 
• http://blog.revolutionanalytics.com/2013/02/10-r-packages- 
every-data-scientist-should-know-about. 
html 
• http://had.co.nz/reshape/paper-dsc2005.pdf 
• http://seananderson.ca/2013/12/01/plyr.html 
• https://github.com/RevolutionAnalytics/rmr2/blob/m 
aster/docs/tutorial.md 
• http://cran.r-project. 
org/web/packages/data.table/index.html 
53
54
Ad

More Related Content

What's hot (20)

Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
Alexey Grishchenko
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Yahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at ScaleYahoo's Experience Running Pig on Tez at Scale
Yahoo's Experience Running Pig on Tez at Scale
DataWorks Summit/Hadoop Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Ran Silberman
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
Apache Flink Deep Dive
Apache Flink Deep DiveApache Flink Deep Dive
Apache Flink Deep Dive
DataWorks Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
DataWorks Summit/Hadoop Summit
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
rebeccatho
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
Farzad Nozarian
 
Spark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student SlidesSpark Summit East 2015 Advanced Devops Student Slides
Spark Summit East 2015 Advanced Devops Student Slides
Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
DataWorks Summit
 
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyCascading for Intuitive Flow Processing with Hadoop (gabor szabo)
PyData
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
MapR Technologies
 
Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...Why you should care about data layout in the file system with Cheng Lian and ...
Why you should care about data layout in the file system with Cheng Lian and ...
Databricks
 
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
How to overcome mysterious problems caused by large and multi-tenancy Hadoop ...
DataWorks Summit/Hadoop Summit
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
Gwen (Chen) Shapira
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
Geoff Hendrey
 
Scaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value StoresScaling HDFS to Manage Billions of Files with Key-Value Stores
Scaling HDFS to Manage Billions of Files with Key-Value Stores
DataWorks Summit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
Lynn Langit
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
CloudxLab
 

Similar to R for hadoopers (20)

Big datacourse
Big datacourseBig datacourse
Big datacourse
Massimiliano Ruocco
 
Hadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptxHadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
Milind Bhandarkar
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
Yahoo Developer Network
 
Step By Step Guide to Learn R
Step By Step Guide to Learn RStep By Step Guide to Learn R
Step By Step Guide to Learn R
Venkata Reddy Konasani
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Lecture1_R Programming Introduction1.ppt
Lecture1_R Programming Introduction1.pptLecture1_R Programming Introduction1.ppt
Lecture1_R Programming Introduction1.ppt
premak23
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Hadoop
HadoopHadoop
Hadoop
Anil Reddy
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
Fabio Fumarola
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Brief introduction to R Lecturenotes1_R .ppt
Brief introduction to R  Lecturenotes1_R .pptBrief introduction to R  Lecturenotes1_R .ppt
Brief introduction to R Lecturenotes1_R .ppt
geethar79
 
R_Language_study_forstudents_R_Material.ppt
R_Language_study_forstudents_R_Material.pptR_Language_study_forstudents_R_Material.ppt
R_Language_study_forstudents_R_Material.ppt
Suresh Babu
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
BusyBird2
 
Hadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptxHadoop-part1 in cloud computing subject.pptx
Hadoop-part1 in cloud computing subject.pptx
JyotiLohar6
 
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Distributed Computing with Apache Hadoop. Introduction to MapReduce.
Konstantin V. Shvachko
 
R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
AmanBhalla14
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
Paul Chao
 
Lecture1_R Programming Introduction1.ppt
Lecture1_R Programming Introduction1.pptLecture1_R Programming Introduction1.ppt
Lecture1_R Programming Introduction1.ppt
premak23
 
Introduction to R.pptx
Introduction to R.pptxIntroduction to R.pptx
Introduction to R.pptx
karthikks82
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
cdmaxime
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
Apache Apex
 
Hadoop in Data Warehousing
Hadoop in Data WarehousingHadoop in Data Warehousing
Hadoop in Data Warehousing
Alexey Grigorev
 
Hadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilindHadoop Tutorial with @techmilind
Hadoop Tutorial with @techmilind
EMC
 
Brief introduction to R Lecturenotes1_R .ppt
Brief introduction to R  Lecturenotes1_R .pptBrief introduction to R  Lecturenotes1_R .ppt
Brief introduction to R Lecturenotes1_R .ppt
geethar79
 
R_Language_study_forstudents_R_Material.ppt
R_Language_study_forstudents_R_Material.pptR_Language_study_forstudents_R_Material.ppt
R_Language_study_forstudents_R_Material.ppt
Suresh Babu
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Lecture1_R.pdf
Lecture1_R.pdfLecture1_R.pdf
Lecture1_R.pdf
BusyBird2
 
Ad

More from Gwen (Chen) Shapira (20)

Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Kafka reliability velocity 17
Kafka reliability   velocity 17Kafka reliability   velocity 17
Kafka reliability velocity 17
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka at scale facebook israel
Kafka at scale   facebook israelKafka at scale   facebook israel
Kafka at scale facebook israel
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
Gwen (Chen) Shapira
 
Have your cake and eat it too
Have your cake and eat it tooHave your cake and eat it too
Have your cake and eat it too
Gwen (Chen) Shapira
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
Gwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Velocity 2019 - Kafka Operations Deep Dive
Velocity 2019  - Kafka Operations Deep DiveVelocity 2019  - Kafka Operations Deep Dive
Velocity 2019 - Kafka Operations Deep Dive
Gwen (Chen) Shapira
 
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote Lies Enterprise Architects Tell - Data Day Texas 2018  Keynote
Lies Enterprise Architects Tell - Data Day Texas 2018 Keynote
Gwen (Chen) Shapira
 
Gluecon - Kafka and the service mesh
Gluecon - Kafka and the service meshGluecon - Kafka and the service mesh
Gluecon - Kafka and the service mesh
Gwen (Chen) Shapira
 
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Multi-Cluster and Failover for Apache Kafka - Kafka Summit SF 17
Gwen (Chen) Shapira
 
Papers we love realtime at facebook
Papers we love   realtime at facebookPapers we love   realtime at facebook
Papers we love realtime at facebook
Gwen (Chen) Shapira
 
Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017Multi-Datacenter Kafka - Strata San Jose 2017
Multi-Datacenter Kafka - Strata San Jose 2017
Gwen (Chen) Shapira
 
Streaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data MeetupStreaming Data Integration - For Women in Big Data Meetup
Streaming Data Integration - For Women in Big Data Meetup
Gwen (Chen) Shapira
 
Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016Kafka connect-london-meetup-2016
Kafka connect-london-meetup-2016
Gwen (Chen) Shapira
 
Fraud Detection for Israel BigThings Meetup
Fraud Detection  for Israel BigThings MeetupFraud Detection  for Israel BigThings Meetup
Fraud Detection for Israel BigThings Meetup
Gwen (Chen) Shapira
 
Kafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be thereKafka Reliability - When it absolutely, positively has to be there
Kafka Reliability - When it absolutely, positively has to be there
Gwen (Chen) Shapira
 
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clustersNyc kafka meetup 2015 - when bad things happen to good kafka clusters
Nyc kafka meetup 2015 - when bad things happen to good kafka clusters
Gwen (Chen) Shapira
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
Gwen (Chen) Shapira
 
Kafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn MeetupKafka and Hadoop at LinkedIn Meetup
Kafka and Hadoop at LinkedIn Meetup
Gwen (Chen) Shapira
 
Kafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka MeetupKafka & Hadoop - for NYC Kafka Meetup
Kafka & Hadoop - for NYC Kafka Meetup
Gwen (Chen) Shapira
 
Ad

Recently uploaded (20)

1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia03 Daniel 2-notes.ppt seminario escatologia
03 Daniel 2-notes.ppt seminario escatologia
Alexander Romero Arosquipa
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
1. Briefing Session_SEED with Hon. Governor Assam - 27.10.pdf
Simran112433
 
Data Science Courses in India iim skills
Data Science Courses in India iim skillsData Science Courses in India iim skills
Data Science Courses in India iim skills
dharnathakur29
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
Adobe Analytics NOAM Central User Group April 2025 Agent AI: Uncovering the S...
gmuir1066
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Ch3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendencyCh3MCT24.pptx measure of central tendency
Ch3MCT24.pptx measure of central tendency
ayeleasefa2
 
LLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bertLLM finetuning for multiple choice google bert
LLM finetuning for multiple choice google bert
ChadapornK
 

R for hadoopers

  • 1. 1 Scalable Analytics with R, Hadoop and RHadoop Gwen Shapira, Software Engineer @gwenshap [email protected]
  • 2. 2
  • 3. 3
  • 4. 4
  • 6. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 6
  • 7. Get Started with R-Studio 7
  • 8. Basic Data Types • String • Number • Boolean • Assignment <- 8
  • 9. R can be a nice calculator > x <- 1 > x * 2 [1] 2 > y <- x + 3 > y [1] 4 > log(y) [1] 1.386294 > help(log) 9
  • 10. Complex Data Types • Vector • c, seq, rep, [] • List • Data Frame • Lists of vectors of same length • Not a matrix 10
  • 11. Creating vectors > v1 <- c(1,2,3,4) [1] 1 2 3 4 > v1 * 4 [1] 4 8 12 16 > v4 <- c(1:5) [1] 1 2 3 4 5 > v2 <- seq(2,12,by=3) [1] 2 5 8 11 > v1 * v2 [1] 2 10 24 44 > v3 <- rep(3,4) [1] 3 3 3 3 11
  • 12. Accessing and filtering vectors > v1 <- c(2,4,6,8) [1] 2 4 6 8 > v1[2] [1] 4 > v1[2:4] [1] 4 6 8 > v1[-2] [1] 2 6 8 > v1[v1>3] [1] 4 6 8 12
  • 13. Lists > lst <- list (1,"x",FALSE) [[1]] [1] 1 [[2]] [1] "x" [[3]] [1] FALSE > lst[1] [[1]] [1] 1 > lst[[1]] [1] 1 13
  • 14. Data Frames books <- read.csv("~/books.csv") books[1,] books[,1] books[3:4] books$price books[books$price==6.99,] martin_price <- books[books$author_t=="George R.R. Martin",]$price mean(martin_price) subset(books,select=-c(id,cat,sequence_i)) 14
  • 15. 15
  • 16. Functions > sq <- function(x) { x*x } > sq(3) [1] 9 16 Note: R is a functional programming language. Functions are first class objects And can be passed to other functions.
  • 18. Agenda • R Basics • Hadoop Basics • Data Manipulation • Rhadoop 18
  • 19. “In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, we didn’t try to grow a larger ox” — Grace Hopper, early advocate of distributed computing
  • 20. 20 Hadoop in a Nutshell
  • 21. Map-Reduce is the interesting bit • Map – Apply a function to each input record • Shuffle & Sort – Partition the map output and sort each partition • Reduce – Apply aggregation function to all values in each partition • Map reads input from disk • Reduce writes output to disk 21
  • 22. Example – Sessionize clickstream 22
  • 23. Sessionize Identify unique “sessions” of interacting with our website Session – for each user (IP), set of clicks that happened within 30 minutes of each other 23
  • 24. Input – Apache Access Log Records 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 24
  • 25. Output – Add Session ID 127.0.0.1 - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 15 25
  • 26. Overview 26 Map Map Map Reduce Reduce Log line Log line Log line IP1, log lines Log line, session ID
  • 27. Map parsedRecord = re.search(‘(d+.d+….’,record) IP = parsedRecord.group(1) timestamp = parsedRecord.group(2) print ((IP,Timestamp),record) 27
  • 28. Shuffle & Sort Partition by: IP Sort by: timestamp Now reduce gets: (IP,timestamp) [record1,record2,record3….] 28
  • 29. Reduce SessionID = 1 curr_record = records[0] Curr_timestamp = getTimestamp(curr_record) foreach record in records: if (curr_timestamp – getTimestamp(record) > 30): sessionID += 1 curr_timestamp = getTimestamp(record) print(record + “ “ + sessionID) 29
  • 30. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 30
  • 31. Reshape2 • Two functions: • Melt – wide format to long format • Cast – long format to wide format • Columns: identifiers or measured variables • Molten data: • Unique identifiers • New column – variable name • New column – value • Default – all numbers are values 31
  • 32. Melt > tips total_bill tip sex smoker day time size 16.99 1.01 Female No Sun Dinner 2 10.34 1.66 Male No Sun Dinner 3 21.01 3.50 Male No Sun Dinner 3 > melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 32
  • 33. Cast > m_tips <- melt(tips) sex smoker day time variable value Female No Sun Dinner total_bill 16.99 Female No Sun Dinner tip 1.01 Female No Sun Dinner size 2 > dcast(m_tips,sex+time~variable,mean) sex time total_bill tip size Female Dinner 19.21308 3.002115 2.461538 Female Lunch 16.33914 2.582857 2.457143 Male Dinner 21.46145 3.144839 2.701613 Male Lunch 18.04848 2.882121 2.363636 33
  • 34. *Apply • apply – apply function on rows or columns of matrix • lapply – apply function on each item of list • Returns list • sapply – like lapply, but return vector • tapply – apply function to subsets of vector or lists 34
  • 35. plyr • Split – apply – combine • Ddply – data frame to data frame ddply(.data, .variables, .fun = NULL, ..., • Summarize – aggregate data into new data frame • Transform – modify data frame 35
  • 36. DDPLY Example > ddply(tips,c("sex","time"),summarize, + mean=mean(tip), + sd=sd(tip), + ratio=mean(tip/total_bill) + ) sex time mean sd ratio 1 Female Dinner 3.002115 1.193483 0.1693216 2 Female Lunch 2.582857 1.075108 0.1622849 3 Male Dinner 3.144839 1.529116 0.1554065 4 Male Lunch 2.882121 1.329017 0.1660826 36
  • 37. Agenda • R Basics • Hadoop Basics • Data Manipulation Libraries • Rhadoop 37
  • 38. Rhadoop Projects • RMR • RHDFS • RHBase • (new) PlyRMR 38
  • 39. Most Important: RMR does not parallelize algorithms. It allows you to implement MapReduce in R. Efficiently. That’s it. 39
  • 40. What does that mean? • Use RMR if you can break your problem down to small pieces and apply the algorithm there • Use commercial R+Hadoop if you need a parallel version of well known algorithm • Good fit: Fit piecewise regression model for each county in the US • Bad fit: Fit piecewise regression model for the entire US population • Bad fit: Logistic regression 40
  • 41. Use-case examples – Good or Bad? 1. Model power consumption per household to determine if incentive programs work 2. Aggregate corn yield per 10x10 portion of field to determine best seeds to use 3. Create churn models for service subscribers and determine who is most likely to cancel 4. Determine correlation between device restarts and support calls 41
  • 42. Second Most Important: RMR requires R, RMR and all libraries you’ll use to be installed on all nodes and accessible by Hadoop user 42
  • 43. RMR is different from Hadoop Streaming. RMR mapper input: Key, [List of Records] This is so we can use vector operations 43
  • 44. How to RMRify a Problem 44
  • 45. In more detail… • Mappers get list of values • You need to process each one independently • But do it for all lines at once. • Reducers work normally 45
  • 46. Demo 6 > library(rmr2) t <- list("hello world","don't worry be happy") unlist(sapply(t,function (x) {strsplit(x," ")})) function(k,v) { ret_k <- unlist(sapply(v,function(x){strsplit(x," ")})) keyval(ret_k,1) } function(k,v) { keyval(k,sum(v))} mapreduce(input=”~/hadoop-recipes/data/shakespeare/Shakespeare_2.txt", output=”~/wc.json",input.format="text”,output.format=”json", map=wc.map,reduce=wc.reduce); 46
  • 47. Cheating in MapReduce: Do everything possible to have map only jobs 47
  • 48. Avg Tips per Person – Naïve Input Gwen 1 Jeff 2 Leon 1 Gwen 2.5 Leon 3 Jeff 1 Gwen 1 Gwen 2 Jeff 1.5 48
  • 49. Avg Tips per Person - Naive avg.map <- function(k,v){keyval(v$V1,v$V2)} avg.reduce <- function(k,v) {keyval(k,mean(v))} mapreduce(input=”~/hadoop-recipes/data/tip1.txt", output="~/avg.txt", input.format=make.input.format("csv"), output.format="text", map=avg.map,reduce=avg.reduce); 49
  • 50. Avg Tips per Person – Awesome Input Gwen 1,2.5,1,2 Jeff 2,1,1.5 Leon 1,3 50
  • 51. Avg Tips per Person - Optimized function(k,v) { v1 <- (sapply(v$V2,function(x){strsplit(as.character(x)," ")})) keyval(v$V1,sapply(v1,function(x){mean(as.numeric(x))})) } mapreduce(input=”~/hadoop-recipes/data/tip2.txt", output="~/avg2.txt", input.format=make.input.format("csv",sep=","), output.format="text",map=avg2.map); 51
  • 52. Few Final RMR Tips • Backend = “local” has files as input and output • Backend = “hadoop” uses HDFS directories • In “hadoop” mode, print(X) inside the mapper will fail the job. • Use: cat(“ERROR!”, file = stderr()) 52
  • 53. Recommended Reading • http://cran.r-project.org/doc/manuals/R-intro.html • http://blog.revolutionanalytics.com/2013/02/10-r-packages- every-data-scientist-should-know-about. html • http://had.co.nz/reshape/paper-dsc2005.pdf • http://seananderson.ca/2013/12/01/plyr.html • https://github.com/RevolutionAnalytics/rmr2/blob/m aster/docs/tutorial.md • http://cran.r-project. org/web/packages/data.table/index.html 53
  • 54. 54

Editor's Notes

  • #16: Modern CPUs are optimized with vector instructions – so many vector operations can be done on entire vectors in one instructions. Loops obviously take many instructions both for the operations and for running through the loop.
  • #20: This quote is excerpted from the one at the beginning of Chapter 1 in Hadoop: The Definitive Guide by Tom White.
  • #22: Example to illustrate MR
  • #40: RevolutionR and Oracle have (expensive) packages of popular algorithms, parallelized.
  • #43: Just saved you hours of debugging. You can thank me later 