Introduction to SparkR

Introduction to R and
integration of SparkR and
Spark’s MLlib
Dang Trung Kien

About me
• Statistics undergraduate
• trungkiendang@hotmail.com

What is R?
• Statistical Programming Language
• Open source
• > 6000 available packages
• widely used in academics and research

Companies that use R
• Facebook
• Google
• Foursquare
• Ford
• Bank of America
• ANZ
• …

Data types
• Vector
• Matrix
• List
• Data frame

Vector
• c(1, 2, 3, 4)
## [1] 1 2 3 4
• 1:4
## [1] 1 2 3 4
• c("a", "b", "c")
## [1] "a" "b" “c"
• c(T, F, T)
## [1] TRUE FALSE TRUE

Matrix
• matrix(c(1, 2, 3, 4), ncol=2)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
• matrix(c(1, 2, 3, 4), ncol=2, byrow=T)
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4

List
• list(12, “twelve")
## [[1]]
## [1] 12
##
## [[2]]
## [1] "twelve"

Data frame
name <- c("A", "B", “C")
age <- c(30, 17, 42)
male <- c(T, F, F)
data.frame(name, age, male)
## name age male
## 1 A 30 TRUE
## 2 B 17 FALSE
## 3 C 42 FALSE

x <- 1:100
y <- 1:100 + runif(100, 0, 20)
m <- lm(y~x)
plot(y~x)
abline(m$coefficients)

But…
• R is single-threaded
• Can only process data sets that fit in a single
machine

SparkR
• An R package that provides a light-weight front-end
to use Apache Spark from R
• exposes the RDD API of Spark as distributed lists
in R
• allows users to interactively run jobs from the R
shell on a cluster

Spark
count
countByKey
countByValue
flatMap
map (lapply)
…
broadcast
includePackage
…
Filter
reduce
reduceByKey
distinct
union
…
+ R

Data flow
Local
Worker
Worker
Worker
R Spark
Context
Java
Spark
Context
Spark R
Executer
JNI

Word count
lines <- textFile(sc, “/path/to/file")
words <- flatMap(lines,
function(line) {
strsplit(line, " “)[[1]]
})
wordCount <- lapply(words, function(word) { list(word, 1L) })
counts <- reduceByKey(wordCount, "+", 2L)
output <- collect(counts)
for (wordcount in output) {
cat(wordcount[[1]], ": ", wordcount[[2]], “n")
}

Machine Learning
• Arthur Samuel (1959): Field of study that gives
computers the ability to learn without being
explicitly programmed.

Machine Learning
• Supervised
Labels, features
Mapping of features to labels
Estimate a concept (model) that is closest to the true mapping
• Unsupervised
No labels
Clustering of data

Machine Learning
• Supervised
Naive Bayes, nearest neighbour, decision tree,
linear regression, support vector machine…
• Unsupervised
K-means, DBSCAN, one-class SVM…

Supervised
• Classification
Cat or dog?

Supervised
• Classification
Cat or dog?
• Regression
Age?

Naive Bayes
• Supervised machine learning
• Classifies texts based on word frequency

Naive Bayes
Π
P(class | doc) = P(class) P(word | class)
word in doc
Π
class argmax P(class | doc) =
class argmax P(class) P(word | class)
word in doc
class argmax log(P(class | doc))
argmax log(P(class))+ Σ
log(P(word | class))
class word in doc
P(c) = number of class c documents in training sets
total number of documents in training sets
P(w | c) = no. of occurences of word w in documents type c + 1
total no. of words in documents type c + size of vocab

“a" “b” “c”
1 1 1 0
2 0 2 1
P(1) = P(2) = 1
2
P(a |1) = 1+1
1+1+ 3
= 2
5
P(b |1) = 2
5
P(c |1) = 1
5
P(a | 2) = 1
5
P(b | 2) = 3
5
P(c | 2) = 2
5
P(1| "a b b") = 1
2
× 2
5
× 2
5
× 2
5
= 0.032
P(2 | "a b b") = 1
2
× 1
5
× 3
5
× 3
5
= 0.036

MLlib
• Spark’s scalable machine learning library
consisting of common learning algorithms and
utilities, including classification, regression,
clustering, collaborative filtering, dimensionality
reduction, as well as underlying optimization
primitives.

MLlib and SparkR
• Currently access to MLlib in SparkR is still in
development. Thus use this method to run MLlib in
R until MLlib is officially integrated into SparkR.

MLlib’s Naive Bayes in R
R RDD of list(label,
features)
Java RDD of
serialised R objects
Scala RDD of
LabeledPoint
rJava
J("org.apache.spark.mllib.classification.NaiveBayes", "train",
labeled.point.rdd, lambda)

Introduction to SparkR

More Related Content

What's hot (20)

Similar to Introduction to SparkR (20)

Recently uploaded (20)

Introduction to SparkR