0% found this document useful (0 votes)
10 views

ECS765P_W5_Spark Programming

The document provides an overview of Spark programming in the context of big data processing, covering topics such as Spark architecture, execution modes, deployment options, and the Structured API. It explains the life-cycle of a Spark application, the creation and manipulation of RDDs, and the various operations available for data analysis. Additionally, it discusses performance tuning and the differences between Spark and MapReduce.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

ECS765P_W5_Spark Programming

The document provides an overview of Spark programming in the context of big data processing, covering topics such as Spark architecture, execution modes, deployment options, and the Structured API. It explains the life-cycle of a Spark application, the creation and manipulation of RDDs, and the various operations available for data analysis. Additionally, it discusses performance tuning and the differences between Spark and MapReduce.

Uploaded by

Yen-Kai Cheng
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

ECS640U/ECS765P Big Data Processing

Spark Programming
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Spark Programming
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Big Data Processing: Week 5
Topic List:

● Spark Program Execution


● The Structured API
● RDDs and parallelism
● RDD operations
Spark architecture

A Spark application consists of a driver process (schedule work, maintain info, respond to users) and a set
of executor processes (run tasks, report back). A cluster manager keeps track of the resources.

Spark components run in Java Virtual Machines (JVM): Scala, which is the language Spark is written in,
compiles into Java bytecode.

Worker Worker Worker


Cluster
Driver
manager
Executor Executor Executor
Execution modes

Several execution modes are distinguished depending on where Spark processes are physically located:

• Local mode: Spark processes run on a single machine, where the number of cores can be selected.
Useful for prototyping, not for production applications.
• Client mode: The driver process is on the machine that submits the application, hence outside the
cluster.
• Cluster mode: The entire application is submitted to a cluster manager. The cluster manager launches
the driver on a worker node, in addition to the executor processes. It is the most common mode.
The diff between client and cluster: where the driver is running, outside or inside the cluster?

outside => client


inside => cluster
Execution modes
Deployment options

Several deployment modes are distinguished depending on the environment where Spark runs:

• Standalone: the Spark Master process acts as cluster manager. (run on a single machine)
• On YARN: Spark can easily run on a Hadoop cluster, with YARN as a cluster manager. The driver process
can run inside an ApplicationMaster managed by YARN (cluster mode) or in a client, with a separate
master that requests resources from YARN (client mode). The executor is assigned to a NodeManager.
• On Mesos: The Mesos master becomes the cluster manager.
Mesos is a distributed systems kernel - the Mesos kernel runs on every machine and provides
applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and
scheduling across entire datacenter and cloud environments.
Execution modes
Life-cycle of a Spark application

The life-cycle of a Spark application in cluster mode consists of the following steps:

• Client request: An application is submitted (spark-submit) to the cluster manager, resources for the
Spark driver process are requested. The Spark driver process is placed onto a node and the client exits.
• Launch: The driver runs user code and initialises a Spark cluster by creating a SparkSession, which will
communicate with the cluster manager asking it to launch Spark executors. The cluster manager
responds by launching the executor processes.
• Execution: Code is ran on the executor processes. The driver schedules tasks onto each worker, which
respond with the status of those tasks.
• Completion: The driver process exits with either success or failure and the cluster manager shuts down
the executors.

(Client mode is similar, the main difference being that the driver is placed outside the cluster)
Big Data Processing: Week 5
Topic List:

● Spark Program Execution


● The Structured API
● RDDs and parallelism
● RDD operations
Structured API overview

Structured APIs are high-level APIs for working with structured data:

• Distributed and in-memory (like RDDs).


https://en.wikipedia.org/wiki/Immutable_object
• Represent immutable, lazily evaluated plans (like RDDs). https://en.wikipedia.org/wiki/Lazy_evaluation
• Table-like collections of records, consisting of rows and columns.

There are three structures:

• Datasets are collections of typed objects and provide compile-time type safety. Only in Java and Scala.
• DataFrames are treated as collections of untyped objects until run-time. Mainly for Python and R.
• SQL tables and views are DataFrames, the only difference being that SQL is executed against them.

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Types, schemas and partitioning

Spark has its own data types:


• Basic types include Boolean, byte, int, string, …
• Complex types include array, tuple, map, …
• Data types map directly to the different language APIs (operations in Python operate on Spark types).

In the Structured API:


• A row is a record that contains an ordered sequence of values. They can be created by using Row().
• A DataFrame is a collection of rows with a schema.
• Schemas define columns (names and types). Manually created or from source (schema-on-read).
• The partitioning scheme describes how records are distributed across a cluster.
DataFrame manual creation

StructField(String name, DataType dataType, boolean nullable)

myManualSchema = StructType([
StructField("some", StringType(), True), True: can be empty value
StructField("col", StringType(), True),
StructField("names", LongType(), False) ])

myRow = Row("Hello", None, 1)

myDf = spark.createDataFrame([myRow], myManualSchema)


Data sources and sinks

DataFrames can be created from raw data sources: https://parquet.apache.org


https://nlp.johnsnowlabs.com/docs/en/ocr

• Spark’s core data sources: CSV, JSON, Parquet, OCR, SQL databases and plain text files.
• Community data sources : MongoDB, Cassandra, HBase, Amazon Redshift, … https://www.mongodb.com
https://cassandra.apache.org/
https://hbase.apache.org
NoSQL Database https://aws.amazon.com/redshift/

Data can be read through the SparkSession via the read attribute:

spark.read.format(...).option(...).schema(...).load()

Data of DataFrame can be written via the write attribute:

df.write.format(...).option(...).save()

https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
Basic operations

Spark defines common operations with tabular data: transformation operation


• The col() method is used for referring to columns
• Columns can be renamed, added and dropped using withColumnRenamed(), withColumn()
and drop()
• Change the type of a column: cast()
• Get unique rows: distinct()
• Get the combination of rows in two dataframes (can have duplicates): Union()

Rows can be collected to the driver using the following actions: action operation
• collect() gets the entire DataFrame, take() gets a number of rows and show() prints out a
number of rows

https://sparkbyexamples.com/pyspark-tutorial/
Data analysis operations

Spark defines computations using common patterns found in data analysis:

• Projections and filters: select(), filter() and where()


• Column summaries: count(), countDistinct(), min(), max(), sum(), avg(), var()
• Grouping by column values using groupBy() or on windows (frame of rows) using over()
• Sorting rows: orderBy()
• Bringing together tables or datasets by joining: join(),
• Creating random splits (for machine learning algorithms): randomSplit()

https://sparkbyexamples.com/pyspark-tutorial/
User-Defined Functions (UDF)

Custom transformations can be written as User-Defined Functions (UDF) in different languages:


• UDFs can take as input and return as output one or more columns
• Spark serialises UDFs on the driver and transfers them to the executors
• If the UDF is written in Scala or Java, it can be used within the JVM with little performance penalty.
• If it is written in Python, a Python process starts on the worker, data is serialised to a Python format,
data is processed in the Python process, results are returned to the JVM. Data serialisation is
expensive and Spark is unable to manage the resources assigned to the Python process.

JVM JVM JVM


Driver

Python Python Python


Structured API execution

Structured API queries are executed as follows:

• DataFrame/Dataset/SQL code is written


• Analysis stage: If valid, Spark converts this code to a logical plan
• Logical optimisation: The catalyst optimiser produces an optimised logical plan.
• Physical planning: The logical plan is transformed using optimisations to a physical plan (RDD)
• Code generation: Java bytecode to run on each machine is generated
Structured API execution: Logical plan

The logical plan represents a set of abstract transformations that do not refer to executors or drivers:
• Valid code results in an unresolved logical plan, however the tables or columns it refers to might not
exist yet.
• In the analyzer, table information from a repository (catalog) is used to resolve columns and tables,
producing the resolved logical plan and the catalyst optimiser produces the optimised logical plan.

User Unresolved Resolved Optimised


code logical plan logical plan logical plan

Catalog

https://www.databricks.com/glossary/catalyst-optimizer
Structured API execution: Physical plan

The physical plan specifies how an optimised logical plan will be executed on a cluster as operations on
RDDs, by generating different planning strategies and comparing them through a cost model.

Physical
plan 1

Optimised Physical Cost Physical


Execution
logical plan plan 2 model plan - best

Physical
plan N
Big Data Processing: Week 5
Topic List:

● Spark Program Execution


● The Structured API
● RDDs and parallelism
● RDD operations

QUIZ and BREAK


The Resilient Distributed Dataset

An RDD
• represents an immutable, partitioned collection of records that can be operated on in parallel,
• and its records are just objects (unlike DataFrames, which have a schema).

Using the low-level API:


• + Gives you complete control on RDDs
• + Allows you to store anything in the objects

• - Manipulations and interactions need to be defined by hand (risk of reinventing the wheel)
• - Optimisations need to be implemented by hand
• - Using Python à UDFs needs serialization, which affects performance, so Python is not recommended
Partitions

RDDs are split into partitions:


• Partitions might be located in different machines in a cluster
• Operations are executed in parallel on each partition
• Operations can also be (lazy) transformations or actions

What is lazy transformation/action?


Lazy evaluation is a key feature of Apache Spark that
Spark can determine the number of partitions: improves its efficiency and performance. It refers to
the strategy where transformations on distributed
• Typically 2-4 partitions for each CPU in the cluster. datasets, are not immediately executed, but instead,
• One CPU core processes one partition at a time their execution is delayed until an ACTION is called

• One partition per file block when creating from storage sources (blocks are 128MB by default in HDFS)
• The size can also be explicitly set when creating an RDD or through a transformation
Partitions

There are two methods to manage partitions:


• coalesce(): coalesces partitions into a smaller number
• repartition(): increases/decreases by randomly reshuffling the data (over the network)

Both methods control the number of partitions and are useful for running operations more efficiently
Useful for balancing data across the partitions, for instance after filtering down a large dataset.
Memory management

RDDs are not materialised in memory until an action is carried out.

• Once the results are obtained, RDDs can be discharged, however might be temporarily held in cache
• RDDs can be recreated from a chain of transformations à very costly to do so

The driver process can explicitly request an RDD to be kept in memory:


• The persist() and cache()methods used to maintain an RDD in memory after its first computation
persist() allows for choosing the storage level (memory or disk)
cache() uses the storage default only which is memory for RDDs and memory+disk for datasets
• Future transformations on the same RDD will run faster
• Key for iterative algorithms and fast interactive use
Example: Log mining

Load error messages from a log into memory, then interactively search for patterns

lines = sc.textFile("/data...")
errors = lines.filter(lambda l: l.startswith("ERROR"))
messages = errors.map(lambda l: l. split("\t")[2])
cachedMsgs = messages.cache()

cachedMsgs.filter(lambda l: l.contains("foo")).count()
cachedMsgs.filter(lambda l: l.contains("bar")).count()
Example: Gradient Decent Algorithm
In the following example, the cache() method avoids loading the data in each iteration:

data = sc.textFile(...).map(readPoint).cache()

w = np.random.rand(D)
Cached so do not need to be recomputed/reloaded for each iteration
for i in range (1,ITERATIONS):
gradient = data.map(lambda p:
(1 / (1 + math.exp(-p[1]*(np.dot(w,p[0]))))- 1)*p[1]*p[0])
.reduce(lambda a, b: a + b)
w -= gradient
}

print("Final w: " + w)
Performance Issues

• All the added expressivity of Spark makes the task of efficiently allocating the different RDDs
much more challenging

• Errors appear more often, and they can be hard to debug

• Knowledge of basics (e.g. Map/Reduce) greatly helps


Performance tuning

• Memory tuning
• Much more prone to OutOfMemory errors than MapReduce.
• How much memory is needed for each RDD partition?
• How many partitions make sense for each RDD?
• What are the performance implications of each operation?

• Good advice can be found in


• http://spark.apache.org/docs/latest/tuning.html
Big Data Processing: Week 5
Topic List:

● Spark Program Execution


● The Structured API
● RDDs and parallelism
● RDD operations

https://sparkbyexamples.com/pyspark-tutorial/
Creating RDDs

Any existing local collection can be converted to an RDD using parallelize(), the number of
partitions can be optionally set: # of partitions

sc.parallelize([1, 2, 3], 2) or
spark.sparkContext.parallelize([1, 2, 3], 2)

RDDs can be created from data sources, such as HDFS:

sc.textFile("/hdfspath/to/file")

In this example, the created RDD is a collection of lines. Other sc methods for external sources are
available such as JSON, CSV, .., etc.

https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#data-sources
Narrow and wide operations

It is convenient to distinguish between:


• Narrow or element-wise operations, which are applied to each record independently. (record by record)
Examples include map(), flatMap() or filter().

• Wide or shuffle operations, which involve records from multiple partitions and are costly. It is of the
same nature as shuffle in MapReduce.
Examples include groupByKey(), join() and reduceByKey().
Spark – MapReduce comparison

Spark MapReduce
map Map
flatMap Map
filter Map
Spark is more sample Map
expressive with
a more rich set
union Map (2 input)
of operations groupByKey Shuffle
reduceByKey ShuffleReduce
join ShuffleReduce
… …
Narrow Transformations

map(): creates a new RDD with the same number of records, each new record is the result
of applying the transformation function to the original record:

tweet = messages.map( lambda x: x.split(",")[3] )

filter(): creates a new RDD with at most the same number of records of the original RDD.
A record is only transferred if the function returns true for the record

grave = logs.filter(lambda x: x.startswith("GRAVE"))

Both map() and filter() results have the same number of partitions as the source RDD.
Narrow transformations

flatMap(): creates a new RDD with a new collection. Each original record generates a variable
number of records when applying the transformation

words = lines.flatMap(lambda x: x.split(" "))

• All records belong to the same collection (no hierarchy)


• Same partitions as source RDD (but potentially containing more records)
• Frequently used for item segmentation/splitting Since it’s just from lines to words
=> # of partitions will remain the
same
Numeric operations

When the type of the elements of an RDD is numeric (e.g. Integer or Double), Spark also provides
aggregated summarisation methods on the RDD

mean(), sum(), max(), min(), variance(), stddev()


Set Operations

• union(): returns elements contained in either RDD that is combining elements from both RDD

The following methods may require shuffle and hence is costly to compute

• intersection(), subtraction(): returns elements contained in both RDDs, appearing in


the first RDD and not the second.

• distinct(): returns a set with the unique elements

• Cartesian(): returns all possible pairs from both sets. Base for performing joins à one of the
most expensive operations
Reduce Operations

reduce() is an action which returns to the driver one single value from the RDD

• Analogous to functional programming.


• Iteratively applies a binary function
list.reduce (lambda a, b: a + b )
[1,2,3,4,5] -> ( (1 + 2 ) + (3 + 4 ) ) + 5 = 15

reduceByKey() is a transformation analogous to MapReduce’s Shuffle+Reduce. Reduces


values for each key, into a new record consisting of (key, result)
For each unique key, the list of values associated with it is reduced into a single value
Grouping

• Some RDDs will be lists of key/value pairs


• Easily created with (k,v) notation and map function:

rdd.map(lambda x: (x,1)) à maps each key x to value to 1

• Tuple keys/values are accessed with the [0]/[1] operator


• Additional transformations/actions are available for RDD consisting of tuples
Group by operations

Group by transformations is similar to the shuffling taking place between MapReduce jobs

• RDD must be a collection of pairs of (key, value) elements


• reduceByKey(): groups together all the values belonging to the same key and computes a
reduce function on each.
The combiner is automatically invoked and is almost equivalent to Shuffle + Reduce

• groupByKey(): returns a dataset of (K, Iterable<V>) pairs. It is equivalent to MapReduce’s


shuffle and if followed by a map or reduce function which is equivalent to MapReduce’s Reduce

• combineByKey: similar to groupByKey but combines the elements for each key using a
custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a
“combined type” C
reduceByKey() and aggeragateByKey() internally call combineByKey()
Joins needs a key

• Joins in Spark are implemented for Tuple RDDs and are performed by the tuple keys

• Involves a costly Shuffle operation

• Often requires a previous map operation to set up join keys

• Join(): Performs an Inner Join with another RDD.


• Other join types also implemented: leftOuterJoin, rightOuterJoin, fullOuterJoin

• Joins are computed much faster if both RDDs have same # of partitions
Retrieving information

• RDDs exist in the cluster, they cannot be read directly

• Actions allow the driver program to retrieve values from the RDDs. Useful for algorithms, and
interactive applications

• Multiple actions defined for that purpose, including:


count(): returns number of elements
takeSample(): returns a sample of elements
reduce(): reduces collection to a single value
collect(): returns whole RDD to driver. à Potential Out Of Memory Errors! Almost never used
Big Data Processing: Week 8
Topic List:

● Spark Program Execution


● The Structured API
● RDDs and parallelism
● RDD operations

QUIZ and END

You might also like