Apache Spark Tutorial

Tutorial: Scalable Data Analytics
using Apache Spark
Dr.Ahmet Bulut
@kral
http://www.linkedin.com/in/ahmetbulut

Cluster Computing
• Apache Spark is a cluster computing platform designed
to be fast and general-purpose.
• Running computational tasks across many worker
machines, or a computing cluster.

Uniﬁed Computing
• In Spark, you can write one application that uses
machine learning to classify data in real time as it is
ingested from streaming sources.
• Simultaneously, analysts can query the resulting data,
also in real time, via SQL (e.g., to join the data with
unstructured log-ﬁles).
• More sophisticated data engineers and data scientists
can access the same data via the Python shell for ad
hoc analysis.

Spark Core
• Spark core:“computational engine” that is responsible
for scheduling, distributing, and monitoring applications
consisting of many computational tasks on a computing
cluster.

Spark Stack
• Spark Core: the basic functionality of Spark, including
components for task scheduling, memory management,
fault recovery, interacting with storage systems, and
more.
• Spark SQL: Spark’s package for working with
structured data.
• Spark Streaming: Spark component that enables
processing of live streams of data.

Spark Stack
• MLlib: library containing common machine learning
(ML) functionality including classiﬁcation, regression,
clustering, and collaborative ﬁltering, as well as
supporting functionality such as model evaluation and
data import.
• GraphX: library for manipulating graphs (e.g., a social
network’s friend graph) and performing graph-parallel
computations.
• Cluster Managers: Standalone Scheduler,Apache
Mesos, HadoopYARN.

“Data Scientist: a person, who is better
at statistics than a computer engineer,  
and better at computer engineering  
than a statistician.”
I do not believe in this new job role. 
Data Science is embracing all stakeholders.

Data Scientists of Spark age
• Data scientists use their skills to analyze data with the
goal of answering a question or discovering insights.
• Data science workﬂow involves ad hoc analysis.
• Data scientists use interactive shells (vs. building
complex applications) for seeing the results to their
queries and for writing snippets of code quickly.

Data Scientists of Spark age
• Spark’s speed and simple APIs shine for data science, and
its built-in libraries mean that many useful algorithms
are available out of the box.

Storage Layer
• Spark can create distributed datasets from any file
stored in the Hadoop distributed filesystem (HDFS) or
other storage systems supported by the Hadoop APIs
(including your local filesystem,Amazon S3, Cassandra,
Hive, HBase, etc.).
• Spark does not require Hadoop; it simply has support
for storage systems implementing the Hadoop APIs.
• Spark supports text files, SequenceFiles, Avro, Parquet,
and any other Hadoop InputFormat.

Downloading Spark
• The ﬁrst step to using Spark is to download and unpack
it.
• For a recent precompiled released version of Spark.
• Visit http://spark.apache.org/downloads.html
• Select the package type of “Pre-built for Hadoop 2.4 and
later,” and click “Direct Download.”
• This will download a compressed TAR ﬁle, or tarball,
called spark-1.2.0-bin-hadoop2.4.tgz.

Directory structure
• README.md 
Contains short instructions for getting started with Spark.
• bin  
Contains executable ﬁles that can be used to interact with
Spark in various ways.

Directory structure
• core, streaming, python, ...  
Contains the source code of major components of the Spark
project.
• examples  
Contains some helpful Spark standalone jobs that you can
look at and run to learn about the Spark API.

PySpark
• The ﬁrst step is to open up one of Spark’s shells.To
open the Python version of the Spark shell, which we
also refer to as the PySpark Shell, go into your Spark
directory and type:  
 
$ bin/pyspark

Logging verbosity
• You can control the verbosity of the logging, create a ﬁle
in the conf directory called log4j.properties.
• To make the logging less verbose, make a copy of conf/
log4j.properties.template called conf/log4j.properties and
ﬁnd the following line:  
log4j.rootCategory=INFO, console 
 
Then lower the log level to 
log4j.rootCategory=WARN, console

IPython
• IPython is an enhanced Python shell that offers features
such as tab completion. Instructions for installing it is at  
http://ipython.org.
• You can use IPython with Spark by setting the
IPYTHON environment variable to 1:  
 
IPYTHON=1 ./bin/pyspark

IPython
• To use the IPython Notebook, which is a web-browser-
based version of IPython, use
IPYTHON_OPTS="notebook" ./bin/pyspark
• On Windows, set the variable and run the shell as
follows:  
set IPYTHON=1  
binpyspark

Script #1
•# Create an RDD 
>>> lines = sc.textFile("README.md")
•# Count the number of items in the RDD 
>>> lines.count()
•# Show the ﬁrst item in the RDD 
>>> lines.ﬁrst()

Resilient Distributed  
Dataset
• The variable lines is an RDD: Resilient Distributed
Dataset.
• on RDDs, you can run parallel operations.

Intro to
Core Spark Concepts
• Every Spark application consists of a driver program
that launches various parallel operations on a cluster.
• Spark Shell is a driver program itself.
• Driver programs access Spark through SparkContext
object, which represents a connection to a computing
cluster.
• In the Spark shell, the context is automatically created
as the variable sc.

Intro to
Core Spark Concepts
• Driver programs manage a number of nodes called
executors.
• For example, running the count() on a cluster would
translate into different nodes counting the different
ranges of the input ﬁle.

Script #2
•>>> lines = sc.textFile(“README.md”)
•>>> pythonLines = lines.ﬁlter(lambda line:“Python” in
line)
•>>> pythonLines.ﬁrst()

Standalone applications
• Apart from running interactively, Spark can be linked
into standalone applications in either Python, Scala, or
Java.
• The main difference is that you need to initialize your
own SparkContext.
• How to py it:  
Write your applications as Python scripts as you
normally do, but to run them with cluster aware logic,
use spark-submit script.

Standalone applications
•$ bin/spark-submit my_script.py
• The spark-submit script sets up the environment for
Spark’s Python API to function by including Spark
dependencies.

Initializing Spark in Python
• # Excerpt from your driver program 
 
from pyspark import SparkConf, SparkContext 
conf = SparkConf().setMaster(“local”).setAppName(“My App”) 
sc = SparkContext(conf=conf)

Operations on RDDs
• Transformations and Actions.
• Transformations construct a new RDD from a previous
one.
• “Filtering data that matches a predicate” is an example
transformation.

Transformations
• Let’s create an RDD that holds strings containing the
word Python.
•>>> pythonLines = lines.ﬁlter(lambda line:“Python” in
line)

Actions
• Actions compute a result based on an RDD.
• They can return the result to the driver, or to an
external storage system (e.g., HDFS).
•>>> pythonLines.ﬁrst()

Transformations & Actions
• You can create RDDs at any time using transformations.
• But, Spark will materialize them once they are used in an
action.
• This is a lazy approach to RDD creation.

Lazy …
• Assume that you want to work with a Big Data file.
• But you are only interested in the lines that contain
Python.
• were Spark to load and save all the lines in the file as
soon as sc.textFile(…) is called, it would waste storage
space.
• Therefore, Spark chooses to see all transformations
first, and then compute the result to an action.

Persistence of RDDs
• RDDs are re-computed each time you run an action on
them.
• In order to re-use an RDD in multiple actions, you can
ask Spark to persist it using RDD.persist().

Resilience of RDDs
• Once computed, RDD is materialized in memory.
• Persistence to disk is also possible.
• Persistence is optional, and not a default behavior.The
reason is that if you are not going to re-use an RDD,
there is no point in wasting storage space by persisting
it.
• The ability to re-compute is what makes RDDs resilient
to node failures.

Working with Key/Value  
Pairs
• Most often you ETL your data into a key/value format.
• Key/value RDDs let you  
count up reviews for each product, 
group together data with the same key, 
group together two different RDDs.

Pair RDD
• RDDs containing key/value pairs are called pair RDDs.
• Pair RDDs are a useful building block in many programs
as they expose operations that allow you to act on each
key in parallel or regroup data across the network.
• For example, pair RDDs have a reduceByKey() method
that can aggregate data separately for each key.
• join() method merges two RDDs together by grouping
elements with the same key.

Creating Pair RDDs
• Use a map() function that returns key/value pairs.
•pairs = lines.map(lambda x: (x.split(“ ”)[0], x))

Transformations on Pair
RDDs
• Let the rdd be [(1,2),(3,4),(3,6)]
• reduceByKey(func) combines values with the same key.
•>>> rdd.reduceByKey(lambda x,y: x+y) —> [(1,2),(3,10)]
•groupByKey() group values with the same key.
•>>> rdd.groupByKey() —> [(1,[2]),(3,[4,6])]

RDDs
• mapValues(func) applies a function to each value of a
pair RDD without changing the key.
•>>> rdd.mapValues(lambda x: x+1)
•keys() returns an rdd of just the keys.
•>>> rdd.keys()
•values() returns an rdd of just the values.
•>>> rdd.values()

RDDs
• sortByKey() returns an rdd, which has the same contents
as the original rdd, but sorted by its keys.
•>>> rdd.sortByKey()

RDDs
•join() performs an inner join between two RDDs.
•let rdd1 be [(1,2),(3,4),(3,6)] and rdd2 be [(3,9)].
•>>> rdd1.join(rdd2) —> [(3,(4,9)),(3,(6,9))]

Pair RDDs are still RDDs
you can also ﬁlter by value! try.

Pair RDDs are still RDDs
• Given that pairs is an RDD with the key being an
integer:
•>>> ﬁlteredRDD = pairs.ﬁlter(lambda x: x[0]>5)

Lets do a word count
•>>> rdd = sc.textFile(“README.md”)
•>>> words = rdd.ﬂatMap(lambda x: x.split(“ ”))
•>>> result =  
words.map(lambda x: (x,1)).reduceByKey(lambda x,y: x+y)

Lets identify the top words
•>>> sc.textFile("README.md") 
.ﬂatMap(lambda x: x.split(" ")) 
.map(lambda x: (x.lower(),1)) 
.reduceByKey(lambda x,y: x+y) 
.map(lambda x: (x[1],x[0])) 
.sortByKey(ascending=False) 
.take(5)

Per key aggregation
•>>> aggregateRDD = rdd.mapValues(lambda x: (x,
1)).reduceByKey(lambda x, y: x[0]+y[0], x[1]+y[1])

Grouping data
• On an RDD consisting of keys of type K and values of
type V, we get back an RDD of type [K, Iterable[V]].
• >>> rdd.groupByKey()
• We can group data from multiple RDDs using cogroup().
• Given two RDDs sharing the same key type K, with the
respective value types asV and W, the resulting RDD is
of type [K, (Iterable[V], Iterable[W])].
• >>> rdd1.cogroup(rdd2)

Joins
• There are two types of joins as inner joins and outer
joins.
• Inner joins require a key to be present in both RDDs.
There is a join() call.
• Outer joins do not require a key to be present in both
RDDs.There is a leftOuterJoin() and rightOuterJoin().
None is used as the value for the RDD which has the
key missing.

Joins
•>>> rdd1,rdd2=[(‘A',1),('B',2),('C',1)],[('A',3),('C',2),('D',
4)]
•>>> rdd1,rdd2=sc.parallelize(rdd1),sc.parallelize(rdd2)
•>>> rdd1.leftOuterJoin(rdd2).collect() 
[('A', (1, 3)), ('C', (1, 2)), ('B', (2, None))]
•>>> rdd1.rightOuterJoin(rdd2).collect() 
[('A', (1, 3)), ('C', (1, 2)), ('D', (None, 4))]

Sorting data
• We can sort an RDD with Key/Value pairs provided that
there is an ordering deﬁned on the key.
• Once we sorted our data, subsequent calls, e.g., collect(),
return ordered data.
•>>> rdd.sortByKey(ascending=True,
numPartitions=None, keyfunc=lambda x: str(x))

Actions on pair RDDs
•>>> rdd1=[(‘A',1),('B',2),('C',1)]
•>>> rdd1.collectAsMap() 
{'A': 1, 'B': 2, 'C': 1}
•>>> rdd1.countByKey()[‘A’] 
1

Accumulators
• Accumulators are shared variables.
• They are used to aggregate values from worker nodes
back to the driver program.
• One of the most common uses of accumulators is to
count events that occur during job execution for
debugging purposes.

Accumulators
•>>> inputﬁle = sc.textFile(inputFile)
• ## Lets create an Accumulator[Int] initialized to 0
•>>> blankLines = sc.accumulator(0)

Accumulators
•>>> def parseOutAndCount(line): 
# Make the global variable accessible 
global blankLines 
if (line == ""): blankLines += 1  
return line.split(" ")
•>>> rdd = inputﬁle.ﬂatMap(parseOutAndCount)
• Do an action so that the workers do real work!
•>>> rdd.saveAsTextFile(outputDir + "/xyz")
•>>> blankLines.value

Accumulators &  
Fault Tolerance
• Spark automatically deals with failed or slow machines
by re-executing failed or slow tasks.
• For example, if the node running a partition of a map()
operation crashes, Spark will rerun it on another node.
• If the node does not crash but is simply much slower
than other nodes, Spark can preemptively launch a
“speculative” copy of the task on another node, and
take its result instead if that ﬁnishes earlier.

Accumulators &  
Fault Tolerance
• Even if no nodes fail, Spark may have to rerun a task to
rebuild a cached value that falls out of memory.  
 
 
“The net result is therefore that the same function may
run multiple times on the same data depending on
what happens on the cluster.”

Accumulators &  
Fault Tolerance
• For accumulators used in actions, Spark applies each
task’s update to each accumulator only once.
• For accumulators used in RDD transformations
instead of actions, this guarantee does not exist.
• Bottomline: use accumulators only in actions.

BroadcastVariables
• Spark’s second type of shared variable, broadcast
variables, allows the program to efﬁciently send a large,
read-only value to all the worker nodes for use in one
or more Spark operations.
• Use it if your application needs to send a large, read-
only lookup table or a large feature vector in a
machine learning algorithm to all the nodes.

Yahoo SEM Click Data
• Dataset:Yahoo’s Search Marketing Advertiser Bid-
Impression-Click data, version 1.0
• 77,850,272 rows, 8.1GB in total.
• Data ﬁelds: 
0 day 
1 anonymized account_id 
2 rank 
3 anonymized keyphrase (list of anonymized keywords) 
4 avg bid 
5 impressions 
6 clicks

Sample data rows
1 08bade48-1081-488f-b459-6c75d75312ae 2 2affa525151b6c51 79021a2e2c836c1a 327e089362aac70c fca90e7f73f3c8ef af26d27737af376a 100.0 2.0 0.0
29 08bade48-1081-488f-b459-6c75d75312ae 3 769ed4a87b5010f4 3d4b990abb0867c8 cd74a8342d25d090 ab9f74ae002e80ff af26d27737af376a 100.0 1.0 0.0
• Primary key: date, account_id, rank and keyphrase.
• Average bid, impressions and clicks information is  
aggregated over the primary key.

Feeling clicky?
keyphrase impressions clicks
iphone 6 plus for cheap 100 2
new samsung tablet 10 1
iphone 5 refurbished 2 0
learn how to program for iphone 200 0

Getting Clicks = Popularity
• Click Through Rate (CTR) = ————————— 
# of impressions
• If CTR > 0, it is a popular keyphrase.
• If CTR == 0, it is an unpopular keyphrase.
# of clicks

Keyphrase = {terms}
• Given keyphrase “iphone 6 plus for cheap”, its terms are:  
 
iphone 
6 
plus 
for 
cheap

Contingency table
Keyphrases got clicks no clicks Total
term t present s n-s n
term t absent
S-s (N-S)-(n-s) N-n
Total S N-S N

Clickiness of a term
• For the term presence to click reception contingency
table shown previously, we can compute a given term t’s
clickiness value ct as follows:
• ct = log —————————— 
(n-s+0.5)/(N-n-S+s+0.5) 
 
 
(s+0.5)/(S-s+0.5)

Clickiness of a keyphrase
• Given a keyphrase K that consists of terms t1 t2 … tn,  
its clickiness can be computed by summing up the
clickiness of the terms present in it.
• That is, cK = ct1 + ct2 + … + ctn

Feeling clicky?
keyphrase impressions clicks clickiness
iphone 6 plus for cheap 100 2 1
new samsung tablet 10 1 1
iphone 5 refurbished 2 0 0
learn how to program for iphone 200 0 0

Clickiness of iphone
Keyphrases got clicks no clicks Total
term iphone present 1 2 3
term iphone absent
1 0 1
Total 2 2 4

Clickiness of iphone
ciphone = log ——————— 
(2+0.5)/(0+0.5) 
 
 
(1+0.5)/(1+0.5)

• Given keyphrases and their clickiness 
 
k1 = t12 t23 … t99 1  
k2 = t19 t201 … t1 0 
k3 = t1 t2 … t101 1 
… 
… 
kn = t1 t2 … t101 1
Mapping

MappingYahoo’s click data
•>>> import math
•>>> rdd = sc.textFile("yahoo_keywords_bids_clicks") 
.map(lambda line: (line.split("t")[3],  
(ﬂoat(line.split(“t")[-2]),ﬂoat(line.split("t")
[-1]))))
•>>> rdd =  
rdd.reduceByKey(lambda x,y:(x[0]+y[0],x[1]+y[1])) 
.mapValues(lambda x: 1 if (x[1]/x[0])>0 else 0)

 
k1 = t12 t23 … t99 1  
k2 = t19 t201 … t1 0 
k3 = t1 t2 … t101 1 
… 
… 
kn = t1 t2 … t101 1
ﬂatMapping
(t19, 0), (t201, 0),…, (t1, 0)
ﬂatMap it to

 
k1 = t12 t23 … t99 1  
k2 = t19 t201 … t1 0 
k3 = t1 t2 … t101 1 
… 
… 
kn = t1 t2 … t101 1
flatMapping
(t19, 0), (t201, 0),…, (t1, 0)
flatMap it to
(t1, 1), (t2, 1),…, (t101, 1)
flatMap it to

flatMapping
•>>> keyphrases0 = rdd.filter(lambda x: x[1]==0)
•>>> keyphrases1 = rdd.filter(lambda x: x[1]==1)
•>>> rdd0 =  
keyphrases0.flatMap(lambda x: [(e,1) for e in x[0].split()])
•>>> rdd1 =  
keyphrases1.flatMap(lambda x: [(e,1) for e in x[0].split()])
•>>> iR = keyphrases0.count()
•>>> R = keyphrases1.count()

Reducing
(t1, 19)
(t12, 19)
(t101, 19)
…
…
(t1, 200)
(t12, 11)
(t101, 1)
…
…
rdd0 rdd1

Reducing by Key and  
MappingValues
•>>> t_rdd0 = rdd0.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(iR-x+0.5))
•>>> t_rdd1 = rdd1.reduceByKey(lambda x,y: x
+y).mapValues(lambda x: (x+0.5)/(R-x+0.5))

MappingValues
(t1, some ﬂoat value)
…
…
…
…
t_rdd0 t_rdd1

Joining to compute ct
…
…
…
…
t_rdd0 t_rdd1

Joining to compute ct
•>>> ct_rdd = t_rdd0.join(t_rdd1).mapValues(lambda x:
math.log(x[1]/x[0]))

Broadcasting to all workers
the look-up table ct
•>>> cts = sc.broadcast(dict(ct_rdd.collect()))

Measuring the accuracy of
clickiness prediction
•>>> def accuracy(rdd, cts, threshold): 
csv_rdd = rdd.map(lambda x: (x[0],x[1],sum([ 
cts.value[t] for t in x[0].split() if t in cts.value]))) 
results = csv_rdd.map(lambda x:  
(x[1] == (1 if x[2] > threshold else 0),1)) 
.reduceByKey(lambda x,y: x+y).collect() 
print ﬂoat(results[1][1]) /  
(results[0][1]+results[1][1])
•>>> accuracy(rdd,cts,10)
•>>> accuracy(rdd,cts,-10)

Spark SQL
• Spark’s interface to work with structured and
semistructured data.
• Structured data is any data that has a schema, i.e., a
know set of ﬁelds for each record.

Spark SQL
• Spark SQL can load data from a variety of structured
sources (e.g., JSON, Hive and Parquet).
• Spark SQL lets you query the data using SQL both
inside a Spark program and from external tools that
connect to Spark SQL through standard database
connectors (JDBC/ODBC), such as business intelligence
tools like Tableau.
• You can join RDDs and SQL Tables using Spark SQL.

Spark SQL
• Spark SQL provides a special type of RDD called
SchemaRDD.
• A SchemaRDD is an RDD of Row objects, each
representing a record.
• A SchemaRDD knows the schema of its rows.
• You can run SQL queries on SchemaRDDs.
• You can create SchemaRDD from external data sources,
from the result of queries, or from regular RDDs.

Spark SQL
• Spark SQL can be used via SQLContext or HiveContext.
• SQLContext supports a subset of Spark SQL
functionality excluding Hive support.
• Use HiveContext.
• If you have an existing Hive installation, you need to
copy your hive-site.xml to Spark’s conﬁguration
directory.

Spark SQL
• Spark will create its own Hive metastore (metadata DB)
called metastore_db in your program’s work directory.
• The tables you create will be placed underneath  
/user/hive/warehouse on your default ﬁle system: 
 
- local FS, or 
 
- HDFS if you have hdfs-site.xml on your classpath.

Creating a HiveContext
• >>> ## Assuming that sc is our SparkContext
•>>> from pyspark.sql import HiveContext, Row
•>>> hiveCtx = HiveContext(sc)

Basic Query Example
• ## Assume that we have an input JSON ﬁle.
•>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”)
•>>> rdd.registerTempTable(“reviews”)
•>>> topterms = hiveCtx.sql(“SELECT * FROM reviews
LIMIT 10").collect()

SchemaRDD
• Both loading data and executing queries return a
SchemaRDD.
• A SchemaRDD is an RDD composed of Row objects
with additional schema information of the types in each
column.
• Row objects are wrappers around arrays of basic types
(e.g., integers and strings).
• In most recent Spark versions, SchemaRDD is renamed
to DataFrame.

SchemaRDD
• A SchemaRDD is also an RDD, and you can run regular
RDD transformations (e.g., map(), and ﬁlter()) on them
as well.
• You can register any SchemaRDD as a temporary table
to query it a via hiveCtx.sql.

Working with Row objects
• In Python, you access the ith row element using row[i] or
using the column name as row.column_name.
•>>> topterms.map(lambda row: row.Keyword)

Caching
• If you expect to run multiple tasks or queries agains the
same data, you can cache it.
•>>> hiveCtx.cacheTable(“mysearchterms”)
• When caching a table, Spark SQL represents the data in
an in-memory columnar format.
• The cached table will be destroyed once the driver
exits.

Printing schema
•>>> rdd=hiveCtx.jsonFile(“reviews_Books.json”)
•>>> rdd.printSchema()

Converting an RDD to a
SchemaRDD
• First create an RDD of Row objects and then call
inferSchema() on it.
•>>> rdd = sc.parallelize([Row(name=“hero”,
favouritecoffee=“industrial blend”)])
•>>> srdd = hiveCtx.inferSchema(rdd)
•>>> srdd.registerTempTable(“myschemardd”)

Working with nested data
•>>> a = [{'name': 'mickey'}, {'name': 'pluto', 'knows':
{'friends': ['mickey',‘donald']}}]
•>>> rdd = sc.parallelize(a)
•>>> rdd.map(lambda x:
json.dumps(x)).saveAsTextFile(“test")
•>>> srdd = sqlContext.jsonFile(“test")

•>>> srdd.registerTempTable("test")
• >>> sqlContext.sql("SELECT knows.friends FROM
test").collect()

MLlib
• Spark’s library of machine learning functions.
• The design philosophy is simple: 
- Invoke ML algorithms on RDDs.

Text Classiﬁcation
• Step 1. Start with an RDD of strings representing your
messages.
• Step 2. Run one of MLlib’s feature extraction algorithms
to convert text into numerical features (suitable for
learning algorithms).The result is an RDD of vectors.
• Step 3. Call a classiﬁcation algorithm (e.g., logistic
regression) on the RDD of vectors.The result is a
model.

Text Classiﬁcation
• Step 4.You can evaluate the model on a test set.
• Step 5.You can use the model for point shooting. Given
a new data sample, you can classify it using the model.

System requirements
• MLlib requires gfortran runtime library for your OS.
• MLlib needs NumPy.

Spam Classiﬁcation
•>>> from pyspark.mllib.regression import LabeledPoint
•>>> from pyspark.mllib.feature import HashingTF
•>>> from pyspark.mllib.classiﬁcation import
LogisticRegressionWithSGD
•>>> spamRows = sc.textFile(“spam.txt”)
•>>> hamRows = sc.textFile(“ham.txt”)

• ### for mapping emails to vectors of 10000 features.
•>>> tf = HashingTF(numFeatures=10000)

• ## Feature Extraction, email —> word features
•>>> spamFeatures = spamRows.map(lambda email:
tf.transform(email.split(“ ”)))
•>>> hamFeatures = hamRows.map(lambda email:
tf.transform(email.split(“ ”)))

• ### Label feature vectors
•>>> spamExamples = spamFeatures.map(lambda
features: LabeledPoint(1, features))
•>>> hamExamples = hamFeatures.map(lambda features:
LabeledPoint(0, features))

•>>> trainingData = spamExamples.union(hamExamples)
• ### Since learning via Logistic Regression is iterative
•>>> trainingData.cache()

•>>> model =
LogisticRegressionWithSGD.train(trainingData)

• ### Lets test it!
•>>> posTest = tf.transform(“O M G GET cheap
stuff”.split(“ ”))
•>>> negTest = tf.transform(“Enjoy Spark on Machine
Learning”.split(“ ”))
•>>> print model.predict(posTest)
•>>> print model.predict(negTest)

Data Types
• MLlib contains a few speciﬁc data types located in
pyspark.mllib.
•Vector : a mathematical vector (sparse or dense).
•LabeledPoint : a pair of feature vector and its label.
•Rating : a rating of a product by a user.
• Various Model classes : the resulting model from
training. It has a predict() function for ad-hoc querying.

Apache Spark Tutorial

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Apache Spark Tutorial (20)

More from Ahmet Bulut (17)

Recently uploaded (20)

Apache Spark Tutorial