0% found this document useful (0 votes)

443 views

"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)

Uploaded by

santoshi sairam

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

443 views

"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)

Uploaded by

santoshi sairam

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 99

“Analytics using Apache Spark”

(Lightening Fast Cluster Computing)

Outline

Introduction to Apache Spark & Ecosystem

How to use Spark?
Resilient Distributed Data (RDD)
Spark Resource Managers
Memory and Persistence
Spark Streaming
Machine Learning Library (MLlib)
Use Cases/Case Studies
How familiar are you with ??
Introduction to Spark
What is Apache Spark?

“Apache Spark™ is a fast and general engine for large-scale data processing”

Monitoring

Scheduling Monitoring Distribution

Latest version – v2.4.3, Released on May 08, 2019

5
Spark Motivation

Disk IO is very RAM: Read: ~10000 MB/s

slow……..
The cost of Memory
has come down….
Opportunity: HD: Read: 100 MB/s

We can have more

“Memory”….Keep
More Data
“in Memory”

Ref: http://www.storagereview.com/images/ramdiskarticle1.jpg
Spark sets a new record in Petabyte Sort

7
Ref: https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
Other Benchmarks

Ref: http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7847709
Evolution
(2007 – 2015?)

Giraph
Pregel
(2004 – 2013) (2014 – ?)
Tez
Drill Storm
Mahout
S4
Impala
GraphLab

Specialized Systems
(iterative, interactive, ML, streaming,
General Batch Processing graph, SQL, etc) General Unified Engine
Evolution…
MapReduce Hadoop
Paper Summit

2002 2004 2006 2008 2010 2012 2014

MapReduce Hadoop Spark Apache Spark

(Google) (Yahoo) Paper (Top Level)
Apache spark…
 Developed in 2009 at UC Berkeley AMPLab
 Open sourced in 2010
 Spark is one of the largest open source communities in Big Data
http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf

spark.apache.org

11
Spark Components
 Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.
 Spark SQL
o Lets you query structured data as a distributed
dataset (RDD) in Spark
 Spark Streaming
o Allows scalable stream processing on micro-
batches of data
 MLlib
o Scalable machine learning library
 Graphx
o Allows custom iterative graph algorithms using Pregel API
Spark Core

 Spark Core
o Provides distributed task dispatching,
scheduling, and basic I/O functionalities.

Scheduling Monitoring Distribution

SparkSQL

 Spark SQL
o Lets you query structured data inside Spark
programs, using either SQL or a familiar
DataFrame API. Usable in Java, Scala, Python
and R.
o Runs SQL / HiveQL queries

 Features:
o Integrated: Mix Spark programs with RDDs (Python, Scala and Java) and Spark SQL queries
o Unified Data Access: Query from different data sources such as Hive tables, JSON.
o Hive Compatibility: Runs modified hive queries on existing warehouses.
o DB Connectivity: JDBC and ODBC
o Performance and Scalability: scales to thousands of nodes
o Spark API for graph-parallel computation.
o Introducing the Resilient Distributed Property
Graph: a directed multigraph with properties
attached to each vertex and edge.
o GraphX includes a growing collection of graph
algorithms and builders to simplify graph analytics
tasks.

Example: Find the oldest follower of each user in a network

val oldestFollowerAge = graph.mrTriplets(

e => (e.dst.id, e.src.age),//Map
(a,b)=> max(a, b) //Reduce
)
.vertices
Spark’s Languages

Scala

Python

Java

R
Spark Packages – Total 96 (As on 27-07-2015)
Spark Packages – Total 190 (As on 2-1-2016)
Spark Packages – Total 378 (As on 26-10-2017)
Spark Packages – Total 451 (As on 17-06-2019)
Hadoop or Spark?
Hadoop MapReduce

2004 - First MapReduce Paper - Simplified

Data processing on large clusters (Jeffrey Dean
and Sanjay Ghemawat, OSDI, 2004, Google,
Inc.)

Map-reduce is a high-level programming

model and implementation for “large-
scale parallel data processing”.

22
Image Source: http://tm.durusau.net/?cat=84
Spark .vs. Hadoop
Spark is a better candidate for iterative jobs
Each stage passes through the hard drives,
(MapReduce iterative jobs involve a lot of disk I/O for each
repetition…Very slow)

Hadoop
MapReduce

Use Memory instead of Disk…10 to 100x faster

Apache Spark
Spark .vs. Hadoop MR – Programming Complexity
Word Count using MR Word Count using Spark
package org.apache.hadoop.examples; public static class IntSumReducer extends
import java.io.IOException; Reducer<Text, IntWritable, Text, IntWritable> {
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration; private IntWritable result = new IntWritable();
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable; public void reduce(Text key, Iterable<IntWritable> from pyspark import SparkContext
import org.apache.hadoop.io.Text; values,
import org.apache.hadoop.mapreduce.Job; Context context) throws IOException,
InterruptedException {
logFile =
import org.apache.hadoop.mapreduce.Mapper;
int sum = 0; "hdfs://localhost:9000/user/bigdatavm/input"
import org.apache.hadoop.mapreduce.Reducer;
import for (IntWritable val : values) {
org.apache.hadoop.mapreduce.lib.input.FileInputFormat sum += val.get(); sc = SparkContext("spark://bigdata-vm:7077",
; } "WordCount")
import result.set(sum);
org.apache.hadoop.mapreduce.lib.input.FileSplit; context.write(key, result);
} textFile = sc.textFile(logFile)
import
org.apache.hadoop.mapreduce.lib.output.FileOutputFor }
mat; wordCounts = textFile.flatMap(lambda line:
import org.apache.hadoop.util.GenericOptionsParser; public static void main(String[] args) throws line.split()).map(lambda word: (word,
Exception {
Configuration conf = new Configuration();
1)).reduceByKey(lambda a, b: a+b)
public class WordCount {
public static class TokenizerMapper extends String[] otherArgs = new
Mapper<Object, Text, Text, IntWritable> { GenericOptionsParser(conf, args)
private final static IntWritable one = new .getRemainingArgs();
IntWritable(1);
private Text word = new Text(); Job job = new Job(conf, "word count");

public void map(Object key, Text value, Context job.setJarByClass(WordCount.class);

context)
Spark .vs. Hadoop MR…
MapReduce Limitations:

 Inefficient if you are repeating similar searches again and again.

 Reduce operations do not take place until all the Maps are complete.
 Designing and Programming using MR is difficult.
 Performance bottlenecks.
 Unable to accommodate the Batch size of real time use cases.
“MR doesn’t compose well for large applications"

Therefore, building specialized systems as workarounds…..

Spark .vs. Hadoop MR…
Spark’s Advantages:

 Generalized patterns  Functional programming / ease of

Uniﬁed engine for many use use
cases Reduction in cost to maintain
 Lazy evaluation of the lineage graph large apps
Reduces wait states, better  Lower overheads
pipelining Starts a job quickly
 In-Memory processing  Less expensive shufﬂes
Uses large memory spaces

27
Is Spark replacing Hadoop ?
Enemies OR Frenemies ??
Can work together even though there are fundamental differences.

Image Credits: www.cartoonbucket.com

Migrating from Hadoop to Spark - Tools
Application Type Hadoop or Other Spark
Distributed Storage HDFS No DFS (It has to use
third party file system)
Batch Processing Hadoop MR Spark RDDs & DAGs
Querying/ETL Hive, PIG, Hbase Spark SQL (Supports
NoSQL data stores),
Spork (Pig on Spark)
Stream Processing or Hadoop Streaming, Spark Streaming
Real time processing Storm
Machine Learning Mahout (Almost migrated MLlib or Mahout
to Spark)
Interactive Exploration -NA- Interactive shells for Scala
and Python
Hadoop .vs. Spark
 Infrastructure Type:
 Hadoop is essentially a distributed data infrastructure: It distributes massive data
collections across multiple nodes within a cluster of commodity server,
 Where as Spark is data-processing framework.
 Performance:
 Spark performs better when all the data fits in the memory, especially on
dedicated clusters;
 Hadoop MapReduce is designed for data that doesn’t fit in the memory and it can
run well alongside other services.
 Compatibility to Data Sources/Types
 Spark’s compatibility to data types and data sources is the same as Hadoop
MapReduce.
Hadoop .vs. Spark
 Batch processing with huge data:
 Hadoop MapReduce is great for batch processing if the data size is huge,
 where as spark needs more resources to compete with Hadoop.
 Fault Tolerance:
 Both have good failure tolerance, but
 Hadoop MapReduce is slightly more tolerant.
 But, Hadoop fault tolerance comes at the cost of computational performance.
 In Spark, we have to re-compute using Lineage information.
 Security:
 Spark security is still in its infancy;
 Hadoop MapReduce has more security features and projects.
Do I need to learn Hadoop first to learn Apache Spark?

No, you don't need to learn Hadoop to learn Spark.

Spark is an independent framework.

If you want to use HDFS or YARN with Spark, better start with Hadoop.
Installing Apache Spark
Installing Spark

1. Using pre-built binaries

(Along with Hadoop)
2. Using Source
3. Using pypi
PySpark is now available in
pypi. To install just run:
$ pip install pyspark
How to use Spark?
Starting a Cluster
Master: You can start a standalone master server by executing:
$ ./sbin/start-master.sh
• Once started, the master will print out a spark://HOST:PORT URL for itself
• Master’s web UI: http://localhost:8080 (by default)
Slave(s): Start one or more workers and connect them to the master via:
• Starts a slave on the machine this script is executed on.
$ ./sbin/start-slave.sh <master-spark-URL>
• Starts a slave instance on each machine specified in the conf/slaves file.
$ ./sbin/start-slaves.sh <master-spark-URL>
Slave’s web UI: http://localhost:8081 (by default)
Configuration
Options

The following
configuration
options can be
passed to the
master and
worker:
Cluster Launch Scripts
 To launch a Spark standalone cluster with the launch scripts, you should create a file called
conf/slaves in your Spark directory, which must contain the hostnames of all the machines
where you intend to start Spark workers, one per line.
 If conf/slaves does not exist, the launch scripts defaults to a single machine (localhost), which
is useful for testing.
 Note, the master machine accesses each of the worker machines via ssh.
 By default, ssh is run in parallel and requires password-less (using a private key) access to be
setup.
 If you do not have a password-less setup, you can set the environment variable
SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Cluster Launch Scripts
 sbin/start-master.sh - Starts a master instance on the machine the script is executed on.
 sbin/start-slaves.sh - Starts a slave instance on each machine specified in the conf/slaves file.
 sbin/start-slave.sh - Starts a slave instance on the machine the script is executed on.
 sbin/start-all.sh - Starts both a master and a number of slaves as described above.
 sbin/stop-master.sh - Stops the master that was started via the sbin/start-master.sh script.
 sbin/stop-slaves.sh - Stops all slave instances on the machines specified in the conf/slaves file.
 sbin/stop-all.sh - Stops both the master and the slaves as described above.

Note that these scripts must be executed on the machine you want to run the Spark master on, not your local machine.
Interactive Shells
Python:
$ pyspark --master local[4]
$ pyspark --master local[*] (Use the threads equal to the no. of cores)
$ pyspark --master local[4] code.py
$ pyspark --master spark://IPAddressOfMaster:7077 code.py
Scala:
$ spark-shell (By default, it takes only 1 thread, Both driver and program)
$ spark-shell --master local[4] --packages “org.example:example:0.1”
$ spark-shell --master local[*] --jars code.jar
Spark Master
The master parameter for spark determines which type and size of cluster to use.

Master URL Meaning

local Runs spark locally on one worker thread

local [k, F] Run Spark locally with K worker threads (ideally, set this to the number of cores
on your machine) and F max failures.
local[*] Run Spark locally with as many worker threads as logical cores on your
machine.
spark://HOST:PO Connect to the given Spark standalone cluster master. The port must be
RT whichever one your master is configured to use, which is 7077 by default.
spark://HOST1:P Connect to the given Spark standalone cluster with standby masters with
ORT1,HOST2:PO Zookeeper. The list must have all the master hosts in the high availability cluster set
RT2…. up with Zookeeper. The port must be whichever each master is configured to use,
which is 7077 by default.
yarn-cluster Connect to a YARN cluster. The cluster location will be found based on the
HADOOP_CONF_DIR or YARN_CONF_DIR variable.
Launching Applications
Run application locally with 8 cores

# Run application locally on 8 cores

./bin/spark-submit \
--master local[8] \
Python_code.py \
100 # Arguments to python program
Launching Applications…
# Run on a Spark standalone cluster in client deploy mode

./bin/spark-submit \
--master spark://207.184.161.138:7077 \
--executor-memory 20G \
--total-executor-cores 100 \
/path/to/examples.jar \
1000

--executor-memory  Amount of memory to use per executor process

--total-executor-cores  The number of cores to use on each executor.

Launching Applications…
Other Examples

# Run application locally on 8 cores

$ spark-submit --class org.apache.spark.examples.SparkPi \
--master local[8] \
/path/to/examples.jar 100

# Run a Python application on a Spark Standalone cluster

$ spark-submit --master spark://207.184.161.138:7077 \
examples/src/main/python/pi.py 100
Installing Jupyter
1. Install Anaconda – Jupyter is part of it.
2. Otherwise, Install using pypi
$ pip install jupyter
Connecting Spark with Jupyter
Method 1 - Configure PySpark driver
a) Add the below lines to your ~/.bashrc file.
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
b) Source your bashrc file
$ source ~/.bashrc
c) Restart your terminal and launch PySpark again:
$ pyspark --master local[4] test.py
Connecting Spark with Jupyter
Method 2 – Using findspark package
 It is more generalized way to use PySpark in a Jupyter Notebook, use
“findspark” package to make a Spark Context available in your code.
 findspark package is not specific to Jupyter Notebook, you can use this trick
in your favorite IDE too.
a) To install findspark:
$ pip install findspark
b) Launch a regular Jupyter Notebook:
$ jupyter-notebook
c) In your Notebook add the following lines:
import findspark
findspark.init()
Spark Execution Model
Master Node

Spark Driver
&Workers…
(with YARN)

Credits:
http://0x0fff.com/spark-architecture/
Spark Program Execution
A spark program has two components:
 Driver Component
 Worker Component

Driver Component runs on:

 Master Node (OR) Local Threads

Worker program runs on:

 Worker Nodes (OR) Local Threads
Spark Program Execution

Initialize driver & Driver builds a Divide DAG into

Submit app to DAG of data and Stages and then
Cluster Manager Operations. Into tasks

Spark Context Build a DAG DAG Scheduling

The results will Schedules the

be Sent to the tasks to executors.
driver. Job (Tells where to run)
finishes.

Finish the Job Task Scheduling

Initializing Spark - Spark Context
 A Spark program first creates a SparkContext object
 Tells Spark how and where to access a cluster
 pySpark shell automatically create the “sc” variable

val conf = new SparkConf().setAppName(appName).setMaster(master)

Scala
new SparkContext(conf)

SparkConf conf = new SparkConf().setAppName(app).setMaster(master);

Java
JavaSparkContext sc = new JavaSparkContext(conf);

conf = SparkConf().setAppName(appName).setMaster(master)
Python
sc = SparkContext(conf=conf)
Spark jargons
Application User program built on Spark. Consists of a driver program and executors on the cluster.
Application jar A jar containing the user's Spark application. In some cases users will want to create an
"uber jar" containing their application along with its dependencies. The user's jar should
never include Hadoop or Spark libraries, however, these will be added at runtime.
Driver program The process running the main part of the application and creating the SparkContext
Cluster manager Acquiring & scheduling resources on the cluster (e.g. standalone manager, Mesos, YARN)
Worker node Any node that can run application code in the cluster, Which have the executors
Executor A process launched for an application on a worker node, that runs tasks and keeps data in
memory or disk storage across them. Each application has its own executors.
Job A parallel computation consisting of multiple tasks that gets spawned in response to a Spark
action (e.g. save, collect); you'll see this term used in the driver's logs.
Stage Each job gets divided into smaller sets of tasks called stages that depend on each other
(similar to the map and reduce stages in MapReduce); You can see this in driver log.
Task A unit of work that will be sent to one executor

DAG DAG stands for Directed Acyclic Graph, in the present context its a DAG of operators.
Resilient Distributed Datasets (RDDs)
What is RDD?
Definition:
Resilient Distributed Dataset (RDD) is
the primary data abstraction in Apache
Spark and the core of Spark.
 Spark Core Concept  Resilient
Distributed Data
 The primary abstraction in Spark
 It is a Partitioned fault-tolerant
collection of data elements.
 Can be operated on in parallel.

Image Credits: http://horicky.blogspot.in/2013/12/spark-low-latency-massively-parallel.html

What is RDD?
Properties:
 Resilient - i.e. fault-tolerant with the help of RDD lineage graph and so able to
recompute missing or damaged partitions due to node failures.
 Distributed - with data residing on multiple nodes in a cluster, Enable operations on
collection of elements in parallel
 Dataset - is a collection of partitioned data with primitive values or values of values,
e.g. tuples or other objects

Note: RDD are Immutable - Cannot be changed once it is created

How to construct RDDs?
RDDs can be constructed:
1. By parallelizing existing Python collections (Parallelized collections)
2. By transforming an existing RDDs (PipliledRDD)
3. From external data (dataset in an external storage system, such as a shared
filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.)
RDDs from Parallelized Collections
 Parallelized collections are created by calling SparkContext’s “parallelize” method on
an existing iterable or collection in your driver program.
 The elements in the collection will be converted as a distributed dataset that can be
operated on in parallel.

data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

 One important parameter for parallel collections is the number of partitions to cut the
dataset into.
 Spark will run one task for each partition of the cluster.
 More Partitions  More parallelism
How data is partitioned?
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

P1 1, 2

P2 3, 4
Number of
Partitions
P3 5, 6

7
P4

Cores allocated for an application are outlined in purple

7 cores are allocated (2+3+2).
How data set is partitioned?

data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

4 partitions  4 Tasks  4 Cores out of 7

How data set is transformed and collected?

# Definition of sub function

def sub(value): 0, 1
return (value – 1)
P1
0, 1,

# Use map for subtraction P2 2, 3 2, 3,

subRDD1 = RDD1.map(sub)
4, 5,
P3 4, 5
# Apply action collect on 6
subRDD to collect numbers P4
6
print subRDD1.collect()
RDDs from External Datasets
 PySpark can create distributed datasets from any storage source supported by Hadoop:
 local file system,
 HDFS,
 Cassandra,
 HBase,
 Amazon S3, etc.
 Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
 Text file RDDs can be created using SparkContext’s textFile method.

distFile = sc.textFile(“/home/bda/data.txt“, minPartitions = 4,

use_unicode=True)
RDDs from External Datasets…
 PySpark can create distributed datasets from any storage source supported by Hadoop:
 local file system,
 HDFS,
 Cassandra,
 HBase,
 Amazon S3, etc.
 Spark supports text files, SequenceFiles, and any other Hadoop InputFormat.
 Text file RDDs can be created using SparkContext’s textFile method.
External Datasets…
NOTE:
 If using a path on the local file system, the file must also be accessible at the same path
on worker nodes. Either copy the file to all workers or use a network-mounted shared
file system.
 All of Spark’s file-based input methods, including textFile, support running on
directories, compressed files, and wildcards as well.
 Number of Partitions of a file:
 Mention the number of partitions as 2 nd argument of the method “textFile”.
 By default, Spark creates one partition for each block of the file (blocks being
128MB by default in HDFS),
 But, you can also ask for a higher number of partitions by passing a larger value.
 Note that you cannot have fewer partitions than blocks.
Reading and Saving RDDs
Spark Python API supports additional formats:
Reading Directories
 SparkContext.wholeTextFiles lets you read a directory containing multiple small
text files, and returns each of them as (filename, content) pairs. This is in contrast
with textFile, which would return one record per line in each file.
Saving RDDs
 RDD.saveAsTextFile( “PATH”)
 RDD.saveAsPickleFile and SparkContext.pickleFile support saving an RDD in a
simple format consisting of pickled Python objects. Batching is used on pickle
serialization, with default batch size 10.
Others
 SequenceFile and Hadoop Input/Output Formats
Pyspark’s writable support of RDDs
Spark Python API supports additional formats:
Writable Support:
 PySpark SequenceFile support loads an RDD of key-value pairs within Java,
 Converts Writables to base Java types, and pickles the resulting Java objects
using Pyrolite.
 When saving an RDD of key-value pairs to SequenceFile, PySpark does the reverse.
 It unpickles Python objects into Java objects and then converts them to Writables.
 The following Writables are automatically converted:
Examples: Writable type “text” is converted into python type “unicode str”
Writable type “intwritable” is converted into python type “int”
RDD Life Cycle
1. Create an input RDD from parallelizing a collection OR from an external data source
2. Perform the required transformations on the base RDD to create new RDDs
3. Instruct Spark to Store/Persist them in Disk/Cache. Persist using cache if it can be
reused later
4. Perform one or more actions on them by parallel computations using spark

Image Credits: Spark devoxx2014, Andre Petrella

Transformations
Transformations
 Creates new datasets (RDDs) from the existing ones
 In other words, transformations are functions that take a RDD as the input and produce
one or many RDDs as the output.
 Input RDD will not be changed (since RDDs are immutable)
 Lazy Evaluation:
 They do not compute their results right away. Instead, they just remember the
transformations applied (Lineage) to some base dataset (e.g. a file).
 Spark optimizes the required calculations
 Computes when there is an “action”
 Transformations can be treated as recipes for creating a result of an action.
 Lineage helps to Recovers from failures and slow workers.
Why Transformations?
1. To understand the data ( List out the number of columns in data and their type)
2. Preprocess the data (Remove null values, Remove outliers).
3. Filter the data
4. Fill the null values or missing values in data ( Filling the null values in data by constant,
mean, median, etc)
5. Deriving new features from data.
And so on….
Transformations

Stores the lineage of a

base RDD

By applying transformations
you incrementally build a
RDD lineage with all the
parent RDDs of the final
RDD(s).
Transformations…

Manipulations Across RDDs Reorganization Tuning

(On single RDD) (On Multiple RDDs) (On single RDD) (On single RDD)
(Key-Value)

map union groupByKey coalesce

flatMap subtract reduceByKey repartition
filter intersection sortByKey
distinct join
Cartesian
Examples – Transformations - map
# Create an RDD
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)

# Transform xrangeRDD through map transformation using sub function

subRDD = RDD1.map(sub)

# Print the lineage

print(subRDD.toDebugString())
Examples – Transformations - map
# Create an RDD
data = [1, 2, 3, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create sub function to subtract 1# Args: One integer value, Returns: One
integer valuedef sub(value):
return (value - 1)

# Transform xrangeRDD through map transformation using sub function

subRDD = RDD1.map(sub)

# Print the lineage

print(subRDD.toDebugString())
More Examples – Transformations – filter & distinct

# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Create an RDD of only even numbers (Filter odd numbers)

RDD2 = RDD1.filter( lambda x: x%2 == 0 )

# Get the distinct even numbers

RDD3 = RDD2.distinct()
More Examples – Transformations – Map & flatMap

# Create an RDD
data = [1, 2, 3]
RDD1 = sc.parallelize(data, 3)

# Create an RDD of x and x+5 pair using map

RDD2 = RDD1.map( lambda x: [ x, x+5 ]
Output: [1, 2, 3]  [ [1,6], [2,7], [3,8] ]

# Create an RDD of x and x+5 pair using flatMap

RDD3 = RDD2.flatMap()
Output: [1, 2, 3]  [ 1, 6, 2, 7, 3, 8 ]
Pair RDDs
 A pair RDD is an RDD where each element is a pair tuple (key, value).
 For example, sc.parallelize([('a', 1), ('a', 2), ('b', 1)]) would create a pair RDD where the
keys are 'a', 'a', 'b' and the values are 1, 2, 1.
 Transformations:
 groupByKey()
 reduceByKey()

pairRDD = sc.parallelize([('a', 1), ('a', 2), ('b', 1)])

# mapValues only used to improve format for printing
print pairRDD.groupByKey().mapValues(lambda x: list(x)).collect()
Output: [('a', [1, 2]), ('b', [1])]
Actions
Actions
 Return a value to the driver program after running a computation on the dataset
 Mechanism to get the results to the driver from workers
 Cause spark to execute the recipe to transform source
Data Fetching Aggregation Output

collect reduce foreach

take(n) count foreachPartition
first countByKey saveAsTextFile
takeSample saveAsSequenceFile
saveAsObjectFile
saveToCassandra
Actions – Examples – count, collect

# Create an RDD
data = [1, 2, 2, 3, 4, 4, 5, 6, 7]
RDD1 = sc.parallelize(data, 4)

# Count the number of elements in RDD1 and print.

print(RDD1.count( ))

# Collect the data of RDD1 and print

print(RDD1.collect())
Additional Actions – Examples
# Get the first element
print(filteredRDD.first())

# The first 4
print(filteredRDD.take(4))

# Create new base RDD to show countByValue

repetitiveRDD = sc.parallelize([1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 3, 3, 3, 4, 5,
4, 6])
print(repetitiveRDD.countByValue())
Output:defaultdict(<type 'int'>, {1: 4, 2: 4, 3: 5, 4: 2, 5: 1, 6: 1})
Types of RDDs

• FilteredRDD • DoubleRDD • CassandraRDD (DataStax)

• MappedRDD • JdbcRDD • GeoRDD (ESRI)
• PairRDD • JsonRDD
• ShuffledRDD • SchemaRDD
• UnionRDD • VertexRDD
• PythonRDD • EdgeRDD
• HadoopRDD
Shared Variables
 When a function passed to a Spark operation (such as map or reduce) is executed on a
remote cluster node, it works on separate copies of all the variables used in the function.
 These variables are copied to each machine, and no updates to the variables on the
remote machine are propagated back to the driver program.
 Supporting general, read-write shared variables across tasks would be inefficient.
 However, Spark does provide two limited types of shared variables for two common
usage patterns:
 broadcast variables
 accumulators.
Broadcast Variables
 Keep a read-only variable cached on each machine
 For example, give every node a copy of a large input dataset
 Reduce communication cost
 Spark actions are executed through a set of stages, separated by distributed “shuffle”
operations.
 Spark automatically broadcasts the common data needed by tasks within each stage.
>>> broadcastVar = sc.broadcast([1, 2, 3])
<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> broadcastVar.value
[1, 2, 3]
Broadcast Variables…
 After the broadcast variable is created, it should be used instead of the value v in any
functions run on the cluster so that v is not shipped to the nodes more than once.
 Broadcast object v should not be modified after it is broadcast in order to ensure that
all nodes get the same value of the broadcast variable (eg. if the variable is shipped to a
new node later).

>>> bvar = sc.broadcast([1, 2, 3])

<pyspark.broadcast.Broadcast object at 0x102789f10>
>>> bvar.value
[1, 2, 3]
Accumulators
 Accumulators are variables that are only “added” to through an associative and
commutative operation.
 They can be used to implement counters or sums.
 Spark natively supports accumulators of numeric types, and programmers can add
support for new types.
 User can create named or unnamed accumulators. As seen in the image below, a named
accumulator (in this instance counter) will display in the web UI for the stage that
modifies that accumulator. Spark displays the value for each accumulator modified by a
task in the “Tasks” table.
Accumulators…
 As seen in the image below, a named accumulator (in this instance counter) will display
in the web UI for the stage that modifies that accumulator. Spark displays the value for
each accumulator modified by a task in the “Tasks” table. NOTE: this is not yet supported in Python.
RDD Execution Model
Job

Executor

RDD Cluster

Partition
Stage

Task

Credits: Spark devoxx2014, Andre Petrella

Spark Execution Engine
Case Study 1: Processing Network Log Files
Example1: Processing a Log File

Error, ts, msg1 Info, ts, msg8 Error, ts, msg3 Error, ts, msg4
Warn, ts, msg2 Warn, ts, msg2 Info, ts, msg5 Warn, ts, msg9
Error, ts, msg1 Info, ts, msg8 Info, ts, msg5 Error, ts, msg1
logLinesRDD
(input/base RDD)

Transformation - Filter Errors  filter(f(x))

Error, ts, msg1 Error, ts, msg3 Error, ts, msg4

Error, ts, msg1 Error, ts, msg1

errorsRDD
Error, ts, msg1 Error, ts, msg3 Error, ts, msg4

Error, ts, msg1 Error, ts, msg1

errorsRDD

Transformation  .coalesce(2)

Error, ts, msg1 Error, ts, msg4

Error, ts, msg3
Error, ts, msg1 Error, ts, msg1
cleanedRDD

Action  .collect( )

Driver
Execute DAG!

.collect( )

Driver
LogLinesRDD
.filter(f(x))

errorsRDD
.coalesce(2)

cleanedRDD
.collect( )

Error, ts, msg1 Error, ts, msg4

Error, ts, msg3
Error, ts, msg1 Error, ts, msg1

Driver
Resource Managers
Resource Managers

- Local

- Standalone

- YARN

- Mesos
Local
CPUs: 3 options:
- local
- local[N]
JVM: Ex + Driver - local[*]

RDD, P1 Task Task

RDD, P2 Task Task

> ./bin/spark-shell --master local[12]
RDD, P2 Task Task

RDD, P3 Task Task

Task Task > ./bin/spark-submit --name "MyFirstApp"

--master local[12] myApp.jar
Internal
Threads

val conf = new SparkConf()

.setMaster("local[12]")
Worker Machine .setAppName(“MyFirstApp")
.set("spark.executor.memory", val sc = new “3g"
Disk SparkContext(conf) )
Standalone
different spark-env.sh

- SPARK_WORKER_CORES

W W W W

Ex Ex Ex Ex
RDD, P1 T T RDD, P4 T T RDD, P5 T T RDD, P7 T T
RDD, P2 T T RDD, P6 T T RDD, P3 T T RDD, P8 T T
RDD, P1 T T RDD, P1 T T RDD, P2 T T RDD, P2 T T
T T
Internal T T Internal Internal
Threads Threads Threads
Internal
Threads

Driver Spark
Master

SSD SSD SSD SSD SSD SSD

OS Disk OS Disk SSD SSD OS Disk OS Disk

SSD SSD SSD SSD SSD SSD SSD SSD

> ./bin/spark-submit --name “SecondApp" vs.

--master spark://host1:port1 myApp.jar
spark-env.sh - SPARK_LOCAL_DIRS
Spark Yarn
Client #1 1
Resource ./bin/spark-submit \
Manager
--master yarn \
--deploy-mode cluster \
--driver-memory 4g \
8
--executor-memory 2g \
2 3 4
--executor-cores 1 \
5
--queue thequeue \
NodeManager NodeManager NodeManager app.py \
6 6 10
App Master
Container Container

7 7

./bin/spark-submit --master yarn

Thank You

All Milestone MCQ PDF
No ratings yet
All Milestone MCQ PDF
75 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Mastering Hadoop
From Everand
Mastering Hadoop
Sandeep Karanth
No ratings yet
Frank Kane's Taming Big Data with Apache Spark and Python
From Everand
Frank Kane's Taming Big Data with Apache Spark and Python
Frank Kane
No ratings yet
Spark and Scala Course
No ratings yet
Spark and Scala Course
5 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Spark Tuning
No ratings yet
Spark Tuning
26 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
SPARK Interview Questions
No ratings yet
SPARK Interview Questions
12 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
Cloudera Developer Training For Apache Spark: Hands-On Exercises
No ratings yet
Cloudera Developer Training For Apache Spark: Hands-On Exercises
61 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
No ratings yet
8 Steps For A Developer To Learn Apache Spark and Delta Lake PDF
35 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Spark
No ratings yet
Spark
33 pages
Spark Streaming Twitter Example
No ratings yet
Spark Streaming Twitter Example
4 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Spark Concept
No ratings yet
Spark Concept
18 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Spark Interview 4
No ratings yet
Spark Interview 4
10 pages
Spark Interview Questions and Answers
100% (2)
Spark Interview Questions and Answers
31 pages
Pyspark IQ FREE Guide
No ratings yet
Pyspark IQ FREE Guide
57 pages
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
PySpark VS SQL Interview Questions
No ratings yet
PySpark VS SQL Interview Questions
16 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
No ratings yet
Azure Data Factory Vs Databricks - 4 Key Differences - Hevo
14 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark Notes
No ratings yet
Spark Notes
6 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
Spark Kafkaintegration PDF
100% (1)
Spark Kafkaintegration PDF
71 pages
Spark 4.0
No ratings yet
Spark 4.0
123 pages
LAB 2- OR GATE
No ratings yet
LAB 2- OR GATE
3 pages
BASICS of C
No ratings yet
BASICS of C
2 pages
DCCN Lab 1-14
No ratings yet
DCCN Lab 1-14
125 pages
AIS Module 7
No ratings yet
AIS Module 7
98 pages
Smart Usage of Open Source License Plate Detection and Using IoT Tools For Private Garage and Parking Solutions
No ratings yet
Smart Usage of Open Source License Plate Detection and Using IoT Tools For Private Garage and Parking Solutions
6 pages
Process Explorer Utility
No ratings yet
Process Explorer Utility
13 pages
MUGIC User S Guide 2020
No ratings yet
MUGIC User S Guide 2020
17 pages
LPC1768 - UART Programming
No ratings yet
LPC1768 - UART Programming
11 pages
Infineon-MOSFET CoolMOS 600V S7T With Integrated Temparature sensor-ApplicationNotes-v01 00-EN
No ratings yet
Infineon-MOSFET CoolMOS 600V S7T With Integrated Temparature sensor-ApplicationNotes-v01 00-EN
22 pages
Message Runner: Compact and Simple Display Units For Easy Creation, Registration and Switching of Messages
No ratings yet
Message Runner: Compact and Simple Display Units For Easy Creation, Registration and Switching of Messages
8 pages
What Keep
No ratings yet
What Keep
5 pages
Design For 4-Bit Vedic Multiplier Using VHDL Module
No ratings yet
Design For 4-Bit Vedic Multiplier Using VHDL Module
12 pages
CEH V11 PRACTICE Practice Test 8
No ratings yet
CEH V11 PRACTICE Practice Test 8
21 pages
Heart Rate Monitor
100% (2)
Heart Rate Monitor
31 pages
12 CS Preterm Answerkeys
No ratings yet
12 CS Preterm Answerkeys
8 pages
Expt No 1-SCR Characteristics
No ratings yet
Expt No 1-SCR Characteristics
2 pages
How To Setup Your Lab - Step by Step Instructions
No ratings yet
How To Setup Your Lab - Step by Step Instructions
10 pages
IOTL Lab Manual
No ratings yet
IOTL Lab Manual
61 pages
Et200sp RQ 4x120vdc 230vac 5a No ST Manual en-US en-US
No ratings yet
Et200sp RQ 4x120vdc 230vac 5a No ST Manual en-US en-US
31 pages
Object Oriented Programming Lab-05 (Separation of Interface and Implementation)
No ratings yet
Object Oriented Programming Lab-05 (Separation of Interface and Implementation)
5 pages
Simio Compatibility Notes
No ratings yet
Simio Compatibility Notes
21 pages
Stack Organization
No ratings yet
Stack Organization
11 pages
Wheel Electronic: Doepfer Musikelektronik GMBH
No ratings yet
Wheel Electronic: Doepfer Musikelektronik GMBH
16 pages
Altova Raptorxml+Xbrl Server 2022: User & Reference Manual
No ratings yet
Altova Raptorxml+Xbrl Server 2022: User & Reference Manual
624 pages
Disk Options in The Linux
No ratings yet
Disk Options in The Linux
24 pages
1 - Introduction To Software Engineering
No ratings yet
1 - Introduction To Software Engineering
27 pages
Unit 4
No ratings yet
Unit 4
3 pages
FC2001 and FC2002: Documentation
No ratings yet
FC2001 and FC2002: Documentation
28 pages
TX4400 Manual v0.94
No ratings yet
TX4400 Manual v0.94
39 pages