0% found this document useful (0 votes)
443 views

"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)

Uploaded by

santoshi sairam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
443 views

"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)

Uploaded by

santoshi sairam
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 99

“Analytics using Apache Spark”

(Lightening Fast Cluster Computing)


Outline

Introduction to Apache Spark & Ecosystem


How to use Spark?
Resilient Distributed Data (RDD)
Spark Resource Managers
Memory and Persistence
Spark Streaming
Machine Learning Library (MLlib)
Use Cases/Case Studies
How familiar are you with ??
Introduction to Spark
What is Apache Spark?

“Apache Spark™ is a fast and general engine for large-scale data processing”

Monitoring

Scheduling Monitoring Distribution

Latest version – v2.4.3, Released on May 08, 2019


5
Spark Motivation

Disk IO is very RAM: Read: ~10000 MB/s


slow……..
The cost of Memory
has come down….
Opportunity: HD: Read: 100 MB/s

We can have more


“Memory”….Keep
More Data
“in Memory”

Ref: http://www.storagereview.com/images/ramdiskarticle1.jpg
Spark sets a new record in Petabyte Sort

7
Ref: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Other Benchmarks

Ref: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7847709
Evolution
(2007 – 2015?)

Giraph
Pregel
(2004 – 2013) (2014 – ?)
Tez
Drill Storm
Mahout
S4
Impala
GraphLab

Specialized Systems
(iterative, interactive, ML, streaming,
General Batch Processing graph, SQL, etc) General Unified Engine
Evolution…
MapReduce Hadoop
Paper Summit

2002 2004 2006 2008 2010 2012 2014

MapReduce Hadoop Spark Apache Spark


(Google) (Yahoo) Paper (Top Level)
Apache spark…
 Developed in 2009 at UC Berkeley AMPLab
 Open sourced in 2010
 Spark is one of the largest open source communities in Big Data
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf

spark.apache.org

11
Spark Components
 Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.
 Spark SQL
o Lets you query structured data as a distributed
dataset (RDD) in Spark
 Spark Streaming
o Allows scalable stream processing on micro-
batches of data
 MLlib
o Scalable machine learning library
 Graphx
o Allows custom iterative graph algorithms using Pregel API
Spark Core

 Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.

Scheduling Monitoring Distribution


SparkSQL

 Spark SQL
o Lets you query structured data inside Spark
programs, using either SQL or a familiar
DataFrame API. Usable in Java, Scala, Python
and R.
o Runs SQL / HiveQL queries

 Features:
o Integrated: Mix Spark programs with RDDs (Python, Scala and Java) and Spark SQL queries
o Unified Data Access: Query from different data sources such as Hive tables, JSON.
o Hive Compatibility: Runs modified hive queries on existing warehouses.
o DB Connectivity: JDBC and ODBC
o Performance and Scalability: scales to thousands of nodes
o Spark API for graph-parallel computation.
o Introducing the Resilient Distributed Property
Graph: a directed multigraph with properties
attached to each vertex and edge.
o GraphX includes a growing collection of graph
algorithms and builders to simplify graph analytics
tasks.

Example: Find the oldest follower of each user in a network

val oldestFollowerAge = graph.mrTriplets(


e => (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
Spark’s Languages

Scala

Python

Java

R
Spark Packages – Total 96 (As on 27-07-2015)
Spark Packages – Total 190 (As on 2-1-2016)
Spark Packages – Total 378 (As on 26-10-2017)
Spark Packages – Total 451 (As on 17-06-2019)
Hadoop or Spark?
Hadoop MapReduce

2004 - First MapReduce Paper - Simplified


Data processing on large clusters (Jeffrey Dean
and Sanjay Ghemawat, OSDI, 2004, Google,
Inc.)

Map-reduce is a high-level programming


model and implementation for “large-
scale parallel data processing”.

22
Image Source: http://tm.durusau.net/?cat=84
Spark .vs. Hadoop
Spark is a better candidate for iterative jobs
Each stage passes through the hard drives,
(MapReduce iterative jobs involve a lot of disk I/O for each
repetition…Very slow)

Hadoop
MapReduce

Use Memory instead of Disk…10 to 100x faster

Apache Spark
Spark .vs. Hadoop MR – Programming Complexity
Word Count using MR Word Count using Spark
package org.apache.hadoop.examples; public static class IntSumReducer extends
import java.io.IOException; Reducer<Text, IntWritable, Text, IntWritable> {
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration; private IntWritable result = new IntWritable();
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; public void reduce(Text key, Iterable<IntWritable> from pyspark import SparkContext
import org.apache.hadoop.io.Text; values,
import org.apache.hadoop.mapreduce.Job; Context context) throws IOException,
InterruptedException {
logFile =
import org.apache.hadoop.mapreduce.Mapper;
int sum = 0; "hdfs://localhost:9000/user/bigdatavm/input"
import org.apache.hadoop.mapreduce.Reducer;
import for (IntWritable val : values) {
org.apache.hadoop.mapreduce.lib.input.FileInputFormat sum += val.get(); sc = SparkContext("spark://bigdata-vm:7077",
; } "WordCount")
import result.set(sum);
org.apache.hadoop.mapreduce.lib.input.FileSplit; context.write(key, result);
} textFile = sc.textFile(logFile)
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFor }
mat; wordCounts = textFile.flatMap(lambda line:
import org.apache.hadoop.util.GenericOptionsParser; public static void main(String[] args) throws line.split()).map(lambda word: (word,
Exception {
Configuration conf = new Configuration();
1)).reduceByKey(lambda a, b: a+b)
public class WordCount {
public static class TokenizerMapper extends String[] otherArgs = new
Mapper<Object, Text, Text, IntWritable> { GenericOptionsParser(conf, args)
private final static IntWritable one = new .getRemainingArgs();
IntWritable(1);
private Text word = new Text(); Job job = new Job(conf, "word count");

public void map(Object key, Text value, Context job.setJarByClass(WordCount.class);


context)
Spark .vs. Hadoop MR…
MapReduce Limitations:

 Inefficient if you are repeating similar searches again and again.


 Reduce operations do not take place until all the Maps are complete.
 Designing and Programming using MR is difficult.
 Performance bottlenecks.
 Unable to accommodate the Batch size of real time use cases.
“MR doesn’t compose well for large applications"

Therefore, building specialized systems as workarounds…..


Spark .vs. Hadoop MR…
Spark’s Advantages:

 Generalized patterns  Functional programming / ease of


Unified engine for many use use
cases Reduction in cost to maintain
 Lazy evaluation of the lineage graph large apps
Reduces wait states, better  Lower overheads
pipelining Starts a job quickly
 In-Memory processing  Less expensive shuffles
Uses large memory spaces

27
Is Spark replacing Hadoop ?
Enemies OR Frenemies ??
Can work together even though there are fundamental differences.

Image Credits: www.cartoonbucket.com


Migrating from Hadoop to Spark - Tools
Application Type Hadoop or Other Spark
Distributed Storage HDFS No DFS (It has to use
third party file system)
Batch Processing Hadoop MR Spark RDDs & DAGs
Querying/ETL Hive, PIG, Hbase Spark SQL (Supports
NoSQL data stores),
Spork (Pig on Spark)
Stream Processing or Hadoop Streaming, Spark Streaming
Real time processing Storm
Machine Learning Mahout (Almost migrated MLlib or Mahout
to Spark)
Interactive Exploration -NA- Interactive shells for Scala
and Python
Hadoop .vs. Spark
 Infrastructure Type:
 Hadoop is essentially a distributed data infrastructure: It distributes massive data
collections across multiple nodes within a cluster of commodity server,
 Where as Spark is data-processing framework.
 Performance:
 Spark performs better when all the data fits in the memory, especially on
dedicated clusters;
 Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can
run well alongside other services.
 Compatibility to Data Sources/Types
 Spark’s compatibility to data types and data sources is the same as Hadoop
MapReduce.
Hadoop .vs. Spark
 Batch processing with huge data:
 Hadoop MapReduce is great for batch processing if the data size is huge,
 where as spark needs more resources to compete with Hadoop.
 Fault Tolerance:
 Both have good failure tolerance, but
 Hadoop MapReduce is slightly more tolerant.
 But, Hadoop fault tolerance comes at the cost of computational performance.
 In Spark, we have to re-compute using Lineage information.
 Security:
 Spark security is still in its infancy;
 Hadoop MapReduce has more security features and projects.
Do I need to learn Hadoop first to learn Apache Spark?

No, you don't need to learn Hadoop to learn Spark.


Spark is an independent framework.

If you want to use HDFS or YARN with Spark, better start with Hadoop.
Installing Apache Spark
Installing Spark

1. Using pre-built binaries


(Along with Hadoop)
2. Using Source
3. Using pypi
PySpark is now available in
pypi. To install just run:
$ pip install pyspark
How to use Spark?
Starting a Cluster
Master: You can start a standalone master server by executing:
$ ./sbin/start-master.sh
• Once started, the master will print out a spark://HOST:PORT URL for itself
• Master’s web UI: http://localhost:8080 (by default)
Slave(s): Start one or more workers and connect them to the master via:
• Starts a slave on the machine this script is executed on.
$ ./sbin/start-slave.sh <master-spark-URL>
• Starts a slave instance on each machine specified in the conf/slaves file.
$ ./sbin/start-slaves.sh <master-spark-URL>
Slave’s web UI: http://localhost:8081 (by default)
Configuration
Options

The following
configuration
options can be
passed to the
master and
worker:
Cluster Launch Scripts
 To launch a Spark standalone cluster with the launch scripts, you should create a file called
conf/slaves in your Spark directory, which must contain the hostnames of all the machines
where you intend to start Spark workers, one per line.
 If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which
is useful for testing.
 Note, the master machine accesses each of the worker machines via ssh.
 By default, ssh is run in parallel and requires password-less (using a private key) access to be
setup.
 If you do not have a password-less setup, you can set the environment variable
SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Cluster Launch Scripts
 sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
 sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
 sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
 sbin/start-all.sh - Starts both a master and a number of slaves as described above.
 sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
 sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
 sbin/stop-all.sh - Stops both the master and the slaves as described above.

Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Interactive Shells
Python:
$ pyspark --master local[4]
$ pyspark --master local[*] (Use the threads equal to the no. of cores)
$ pyspark --master local[4] code.py
$ pyspark --master spark://IPAddressOfMaster:7077 code.py
Scala:
$ spark-shell (By default, it takes only 1 thread, Both driver and program)
$ spark-shell --master local[4] --packages “org.example:example:0.1”
$ spark-shell --master local[*] --jars code.jar
Spark Master
The master parameter for spark determines which type and size of cluster to use.

Master URL Meaning

local Runs spark locally on one worker thread

local [k, F] Run Spark locally with K worker threads (ideally, set this to the number of cores
on your machine) and F max failures.
local[*] Run Spark locally with as many worker threads as logical cores on your
machine.
spark://HOST:PO Connect to the given Spark standalone cluster master. The port must be
RT whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:P Connect to the given Spark standalone cluster with standby masters with
ORT1,HOST2:PO Zookeeper. The list must have all the master hosts in the high availability cluster set
RT2…. up with Zookeeper. The port must be whichever each master is configured to use,
which is 7077 by default.
yarn-cluster Connect to a YARN cluster. The cluster location will be found based on the
HADOOP_CONF_DIR or YARN_CONF_DIR variable.
Launching Applications
Run application locally with 8 cores

# Run application locally on 8 cores


./bin/spark-submit \
--master local[8] \
Python_code.py \
100 # Arguments to python program
Launching Applications…
# Run on a Spark standalone cluster in client deploy mode

./bin/spark-submit \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

--executor-memory  Amount of memory to use per executor process

--total-executor-cores  The number of cores to use on each executor.


Launching Applications…
Other Examples

# Run application locally on 8 cores


$ spark-submit --class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar 100

# Run a Python application on a Spark Standalone cluster


$ spark-submit --master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py 100
Installing Jupyter
1. Install Anaconda – Jupyter is part of it.
2. Otherwise, Install using pypi
$ pip install jupyter
Connecting Spark with Jupyter
Method 1 - Configure PySpark driver
a) Add the below lines to your ~/.bashrc file.
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
b) Source your bashrc file
$ source ~/.bashrc
c) Restart your terminal and launch PySpark again:
$ pyspark --master local[4] test.py
Connecting Spark with Jupyter
Method 2 – Using findspark package
 It is more generalized way to use PySpark in a Jupyter Notebook, use
“findspark” package to make a Spark Context available in your code.
 findspark package is not specific to Jupyter Notebook, you can use this trick
in your favorite IDE too.
a) To install findspark:
$ pip install findspark
b) Launch a regular Jupyter Notebook:
$ jupyter-notebook
c) In your Notebook add the following lines:
import findspark
findspark.init()
Spark Execution Model
Master Node

Spark Driver
&Workers…
(with YARN)

Credits:
http://0x0fff.com/spark-architecture/
Spark Program Execution
A spark program has two components:
 Driver Component
 Worker Component

Driver Component runs on:


 Master Node (OR) Local Threads

Worker program runs on:


 Worker Nodes (OR) Local Threads
Spark Program Execution

Initialize driver & Driver builds a Divide DAG into


Submit app to DAG of data and Stages and then
Cluster Manager Operations. Into tasks

Spark Context Build a DAG DAG Scheduling

The results will Schedules the


be Sent to the tasks to executors.
driver. Job (Tells where to run)
finishes.

Finish the Job Task Scheduling


Initializing Spark - Spark Context
 A Spark program first creates a SparkContext object
 Tells Spark how and where to access a cluster
 pySpark shell automatically create the “sc” variable

val conf = new SparkConf().setAppName(appName).setMaster(master)


Scala
new SparkContext(conf)

SparkConf conf = new SparkConf().setAppName(app).setMaster(master);


Java
JavaSparkContext sc = new JavaSparkContext(conf);

conf = SparkConf().setAppName(appName).setMaster(master)
Python
sc = SparkContext(conf=conf)
Spark jargons
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main part of the application and creating the SparkContext
Cluster manager Acquiring & scheduling resources on the cluster (e.g. standalone manager, Mesos, YARN)
Worker node Any node that can run application code in the cluster, Which have the executors
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); You can see this in driver log.
Task A unit of work that will be sent to one executor

DAG DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Resilient Distributed Datasets (RDDs)
What is RDD?
Definition:
Resilient Distributed Dataset (RDD) is
the primary data abstraction in Apache
Spark and the core of Spark.
 Spark Core Concept  Resilient
Distributed Data
 The primary abstraction in Spark
 It is a Partitioned fault-tolerant
collection of data elements.
 Can be operated on in parallel.

Image Credits: http://horicky.blogspot.in/2013/12/spark-low-latency-massively-parallel.html


What is RDD?
Properties:
 Resilient - i.e. fault-tolerant with the help of RDD lineage graph and so able to
recompute missing or damaged partitions due to node failures.
 Distributed - with data residing on multiple nodes in a cluster, Enable operations on
collection of elements in parallel
 Dataset - is a collection of partitioned data with primitive values or values of values,
e.g. tuples or other objects

Note: RDD are Immutable - Cannot be changed once it is created


How to construct RDDs?
RDDs can be constructed:
1. By parallelizing existing Python collections (Parallelized collections)
2. By transforming an existing RDDs (PipliledRDD)
3. From external data (dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.)
RDDs from Parallelized Collections
 Parallelized collections are created by calling SparkContext’s “parallelize” method on
an existing iterable or collection in your driver program.
 The elements in the collection will be converted as a distributed dataset that can be
operated on in parallel.

data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

 One important parameter for parallel collections is the number of partitions to cut the
dataset into.
 Spark will run one task for each partition of the cluster.
 More Partitions  More parallelism
How data is partitioned?
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

P1 1, 2

P2 3, 4
Number of
Partitions
P3 5, 6

7
P4

Cores allocated for an application are outlined in purple


7 cores are allocated (2+3+2).
How data set is partitioned?

data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

4 partitions  4 Tasks  4 Cores out of 7


How data set is transformed and collected?

# Definition of sub function


def sub(value): 0, 1
return (value – 1)
P1
0, 1,

# Use map for subtraction P2 2, 3 2, 3,


subRDD1 = RDD1.map(sub)
4, 5,
P3 4, 5
# Apply action collect on 6
subRDD to collect numbers P4
6
print subRDD1.collect()
RDDs from External Datasets
 PySpark can create distributed datasets from any storage source supported by Hadoop:
 local file system,
 HDFS,
 Cassandra,
 HBase,
 Amazon S3, etc.
 Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
 Text file RDDs can be created using SparkContext’s textFile method.

distFile = sc.textFile(“/home/bda/data.txt“, minPartitions = 4,


use_unicode=True)
RDDs from External Datasets…
 PySpark can create distributed datasets from any storage source supported by Hadoop:
 local file system,
 HDFS,
 Cassandra,
 HBase,
 Amazon S3, etc.
 Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
 Text file RDDs can be created using SparkContext’s textFile method.
External Datasets…
NOTE:
 If using a path on the local file system, the file must also be accessible at the same path
on worker nodes. Either copy the file to all workers or use a network-mounted shared
file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well.
 Number of Partitions of a file:
 Mention the number of partitions as 2 nd argument of the method “textFile”.
 By default, Spark creates one partition for each block of the file (blocks being
128MB by default in HDFS),
 But, you can also ask for a higher number of partitions by passing a larger value.
 Note that you cannot have fewer partitions than blocks.
Reading and Saving RDDs
Spark Python API supports additional formats:
Reading Directories
 SparkContext.wholeTextFiles lets you read a directory containing multiple small
text files, and returns each of them as (filename, content) pairs. This is in contrast
with textFile, which would return one record per line in each file.
Saving RDDs
 RDD.saveAsTextFile( “PATH”)
 RDD.saveAsPickleFile and SparkContext.pickleFile support saving an RDD in a
simple format consisting of pickled Python objects. Batching is used on pickle
serialization, with default batch size 10.
Others
 SequenceFile and Hadoop Input/Output Formats
Pyspark’s writable support of RDDs
Spark Python API supports additional formats:
Writable Support:
 PySpark SequenceFile support loads an RDD of key-value pairs within Java,
 Converts Writables to base Java types, and pickles the resulting Java objects
using Pyrolite.
 When saving an RDD of key-value pairs to SequenceFile, PySpark does the reverse.
 It unpickles Python objects into Java objects and then converts them to Writables.
 The following Writables are automatically converted:
Examples: Writable type “text” is converted into python type “unicode str”
Writable type “intwritable” is converted into python type “int”
RDD Life Cycle
1. Create an input RDD from parallelizing a collection OR from an external data source
2. Perform the required transformations on the base RDD to create new RDDs
3. Instruct Spark to Store/Persist them in Disk/Cache. Persist using cache if it can be
reused later
4. Perform one or more actions on them by parallel computations using spark

Image Credits: Spark devoxx2014, Andre Petrella


Transformations
Transformations
 Creates new datasets (RDDs) from the existing ones
 In other words, transformations are functions that take a RDD as the input and produce
one or many RDDs as the output.
 Input RDD will not be changed (since RDDs are immutable)
 Lazy Evaluation:
 They do not compute their results right away. Instead, they just remember the
transformations applied (Lineage) to some base dataset (e.g. a file).
 Spark optimizes the required calculations
 Computes when there is an “action”
 Transformations can be treated as recipes for creating a result of an action.
 Lineage helps to Recovers from failures and slow workers.
Why Transformations?
1. To understand the data ( List out the number of columns in data and their type)
2. Preprocess the data (Remove null values, Remove outliers).
3. Filter the data
4. Fill the null values or missing values in data ( Filling the null values in data by constant,
mean, median, etc)
5. Deriving new features from data.
And so on….
Transformations

Stores the lineage of a


base RDD

By applying transformations
you incrementally build a
RDD lineage with all the
parent RDDs of the final
RDD(s).
Transformations…

Manipulations Across RDDs Reorganization Tuning


(On single RDD) (On Multiple RDDs) (On single RDD) (On single RDD)
(Key-Value)

map union groupByKey coalesce


flatMap subtract reduceByKey repartition
filter intersection sortByKey
distinct join
Cartesian
Examples – Transformations - map
# Create an RDD
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)

# Transform xrangeRDD through map transformation using sub function


subRDD = RDD1.map(sub)

# Print the lineage


print(subRDD.toDebugString())
Examples – Transformations - map
# Create an RDD
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)

# Transform xrangeRDD through map transformation using sub function


subRDD = RDD1.map(sub)

# Print the lineage


print(subRDD.toDebugString())
More Examples – Transformations – filter & distinct

# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create an RDD of only even numbers (Filter odd numbers)


RDD2 = RDD1.filter( lambda x: x%2 == 0 )

# Get the distinct even numbers


RDD3 = RDD2.distinct()
More Examples – Transformations – Map & flatMap

# Create an RDD
data = [1, 2, 3]
RDD1 = sc.parallelize(data, 3)

# Create an RDD of x and x+5 pair using map


RDD2 = RDD1.map( lambda x: [ x, x+5 ]
Output: [1, 2, 3]  [ [1,6], [2,7], [3,8] ]

# Create an RDD of x and x+5 pair using flatMap


RDD3 = RDD2.flatMap()
Output: [1, 2, 3]  [ 1, 6, 2, 7, 3, 8 ]
Pair RDDs
 A pair RDD is an RDD where each element is a pair tuple (key, value).
 For example, sc.parallelize([('a', 1), ('a', 2), ('b', 1)]) would create a pair RDD where the
keys are 'a', 'a', 'b' and the values are 1, 2, 1.
 Transformations:
 groupByKey()
 reduceByKey()

pairRDD = sc.parallelize([('a', 1), ('a', 2), ('b', 1)])


# mapValues only used to improve format for printing
print pairRDD.groupByKey().mapValues(lambda x: list(x)).collect()
Output: [('a', [1, 2]), ('b', [1])]
Actions
Actions
 Return a value to the driver program after running a computation on the dataset
 Mechanism to get the results to the driver from workers
 Cause spark to execute the recipe to transform source
Data Fetching Aggregation Output

collect reduce foreach


take(n) count foreachPartition
first countByKey saveAsTextFile
takeSample saveAsSequenceFile
saveAsObjectFile
saveToCassandra
Actions – Examples – count, collect

# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Count the number of elements in RDD1 and print.


print(RDD1.count( ))

# Collect the data of RDD1 and print


print(RDD1.collect())
Additional Actions – Examples
# Get the first element
print(filteredRDD.first())

# The first 4
print(filteredRDD.take(4))

# Create new base RDD to show countByValue


repetitiveRDD = sc.parallelize([1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 4, 5,
4, 6])
print(repetitiveRDD.countByValue())
Output:defaultdict(<type 'int'>, {1: 4, 2: 4, 3: 5, 4: 2, 5: 1, 6: 1})
Types of RDDs

• FilteredRDD • DoubleRDD • CassandraRDD (DataStax)


• MappedRDD • JdbcRDD • GeoRDD (ESRI)
• PairRDD • JsonRDD
• ShuffledRDD • SchemaRDD
• UnionRDD • VertexRDD
• PythonRDD • EdgeRDD
• HadoopRDD
Shared Variables
 When a function passed to a Spark operation (such as map or reduce) is executed on a
remote cluster node, it works on separate copies of all the variables used in the function.
 These variables are copied to each machine, and no updates to the variables on the
remote machine are propagated back to the driver program.
 Supporting general, read-write shared variables across tasks would be inefficient.
 However, Spark does provide two limited types of shared variables for two common
usage patterns:
 broadcast variables
 accumulators.
Broadcast Variables
 Keep a read-only variable cached on each machine
 For example, give every node a copy of a large input dataset
 Reduce communication cost
 Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations.
 Spark automatically broadcasts the common data needed by tasks within each stage.
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> broadcastVar.value
[1, 2, 3]
Broadcast Variables…
 After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster so that v is not shipped to the nodes more than once.
 Broadcast object v should not be modified after it is broadcast in order to ensure that
all nodes get the same value of the broadcast variable (eg. if the variable is shipped to a
new node later).

>>> bvar = sc.broadcast([1, 2, 3])


<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> bvar.value
[1, 2, 3]
Accumulators
 Accumulators are variables that are only “added” to through an associative and
commutative operation.
 They can be used to implement counters or sums.
 Spark natively supports accumulators of numeric types, and programmers can add
support for new types.
 User can create named or unnamed accumulators. As seen in the image below, a named
accumulator (in this instance counter) will display in the web UI for the stage that
modifies that accumulator. Spark displays the value for each accumulator modified by a
task in the “Tasks” table.
Accumulators…
 As seen in the image below, a named accumulator (in this instance counter) will display
in the web UI for the stage that modifies that accumulator. Spark displays the value for
each accumulator modified by a task in the “Tasks” table. NOTE: this is not yet supported in Python.
RDD Execution Model
Job

Executor

RDD Cluster

Partition
Stage

Task

Credits: Spark devoxx2014, Andre Petrella


Spark Execution Engine
Case Study 1: Processing Network Log Files
Example1: Processing a Log File

Error, ts, msg1 Info, ts, msg8 Error, ts, msg3 Error, ts, msg4
Warn, ts, msg2 Warn, ts, msg2 Info, ts, msg5 Warn, ts, msg9
Error, ts, msg1 Info, ts, msg8 Info, ts, msg5 Error, ts, msg1
logLinesRDD
(input/base RDD)

Transformation - Filter Errors  filter(f(x))

Error, ts, msg1 Error, ts, msg3 Error, ts, msg4

Error, ts, msg1 Error, ts, msg1

errorsRDD
Error, ts, msg1 Error, ts, msg3 Error, ts, msg4

Error, ts, msg1 Error, ts, msg1


errorsRDD

Transformation  .coalesce(2)

Error, ts, msg1 Error, ts, msg4


Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

Action  .collect( )

Driver
Execute DAG!

.collect( )

Driver
LogLinesRDD
.filter(f(x))

errorsRDD
.coalesce(2)

cleanedRDD
.collect( )

Error, ts, msg1 Error, ts, msg4


Error, ts, msg3
Error, ts, msg1 Error, ts, msg1

Driver
Resource Managers
Resource Managers

- Local

- Standalone

- YARN

- Mesos
Local
CPUs: 3 options:
- local
- local[N]
JVM: Ex + Driver - local[*]

RDD, P1 Task Task

RDD, P1 Task Task

RDD, P2 Task Task


> ./bin/spark-shell --master local[12]
RDD, P2 Task Task

RDD, P3 Task Task

Task Task > ./bin/spark-submit --name "MyFirstApp"


--master local[12] myApp.jar
Internal
Threads

val conf = new SparkConf()


.setMaster("local[12]")
Worker Machine .setAppName(“MyFirstApp")
.set("spark.executor.memory", val sc = new “3g"
Disk SparkContext(conf) )
Standalone
different spark-env.sh

- SPARK_WORKER_CORES

W W W W

Ex Ex Ex Ex
RDD, P1 T T RDD, P4 T T RDD, P5 T T RDD, P7 T T
RDD, P2 T T RDD, P6 T T RDD, P3 T T RDD, P8 T T
RDD, P1 T T RDD, P1 T T RDD, P2 T T RDD, P2 T T
T T
Internal T T Internal Internal
Threads Threads Threads
Internal
Threads

Driver Spark
Master

SSD SSD SSD SSD SSD SSD


OS Disk OS Disk SSD SSD OS Disk OS Disk

SSD SSD SSD SSD SSD SSD SSD SSD

> ./bin/spark-submit --name “SecondApp" vs.


--master spark://host1:port1 myApp.jar
spark-env.sh - SPARK_LOCAL_DIRS
Spark Yarn
Client #1 1
Resource ./bin/spark-submit \
Manager
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
8
--executor-memory 2g \
2 3 4
--executor-cores 1 \
5
--queue thequeue \
NodeManager NodeManager NodeManager app.py \
6 6 10
App Master
Container Container

7 7

./bin/spark-submit --master yarn


Thank You

You might also like