ECS765P_W5_Spark Programming
ECS765P_W5_Spark Programming
Spark Programming
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
ECS640U/ECS765P Big Data Processing
Spark Programming
Lecturer: Ahmed M. A. Sayed
School of Electronic Engineering and Computer Science
Big Data Processing: Week 5
Topic List:
A Spark application consists of a driver process (schedule work, maintain info, respond to users) and a set
of executor processes (run tasks, report back). A cluster manager keeps track of the resources.
Spark components run in Java Virtual Machines (JVM): Scala, which is the language Spark is written in,
compiles into Java bytecode.
Several execution modes are distinguished depending on where Spark processes are physically located:
• Local mode: Spark processes run on a single machine, where the number of cores can be selected.
Useful for prototyping, not for production applications.
• Client mode: The driver process is on the machine that submits the application, hence outside the
cluster.
• Cluster mode: The entire application is submitted to a cluster manager. The cluster manager launches
the driver on a worker node, in addition to the executor processes. It is the most common mode.
The diff between client and cluster: where the driver is running, outside or inside the cluster?
Several deployment modes are distinguished depending on the environment where Spark runs:
• Standalone: the Spark Master process acts as cluster manager. (run on a single machine)
• On YARN: Spark can easily run on a Hadoop cluster, with YARN as a cluster manager. The driver process
can run inside an ApplicationMaster managed by YARN (cluster mode) or in a client, with a separate
master that requests resources from YARN (client mode). The executor is assigned to a NodeManager.
• On Mesos: The Mesos master becomes the cluster manager.
Mesos is a distributed systems kernel - the Mesos kernel runs on every machine and provides
applications (e.g., Hadoop, Spark, Kafka, Elasticsearch) with API’s for resource management and
scheduling across entire datacenter and cloud environments.
Execution modes
Life-cycle of a Spark application
The life-cycle of a Spark application in cluster mode consists of the following steps:
• Client request: An application is submitted (spark-submit) to the cluster manager, resources for the
Spark driver process are requested. The Spark driver process is placed onto a node and the client exits.
• Launch: The driver runs user code and initialises a Spark cluster by creating a SparkSession, which will
communicate with the cluster manager asking it to launch Spark executors. The cluster manager
responds by launching the executor processes.
• Execution: Code is ran on the executor processes. The driver schedules tasks onto each worker, which
respond with the status of those tasks.
• Completion: The driver process exits with either success or failure and the cluster manager shuts down
the executors.
(Client mode is similar, the main difference being that the driver is placed outside the cluster)
Big Data Processing: Week 5
Topic List:
Structured APIs are high-level APIs for working with structured data:
• Datasets are collections of typed objects and provide compile-time type safety. Only in Java and Scala.
• DataFrames are treated as collections of untyped objects until run-time. Mainly for Python and R.
• SQL tables and views are DataFrames, the only difference being that SQL is executed against them.
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
Types, schemas and partitioning
myManualSchema = StructType([
StructField("some", StringType(), True), True: can be empty value
StructField("col", StringType(), True),
StructField("names", LongType(), False) ])
• Spark’s core data sources: CSV, JSON, Parquet, OCR, SQL databases and plain text files.
• Community data sources : MongoDB, Cassandra, HBase, Amazon Redshift, … https://www.mongodb.com
https://cassandra.apache.org/
https://hbase.apache.org
NoSQL Database https://aws.amazon.com/redshift/
Data can be read through the SparkSession via the read attribute:
spark.read.format(...).option(...).schema(...).load()
df.write.format(...).option(...).save()
https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html
Basic operations
Rows can be collected to the driver using the following actions: action operation
• collect() gets the entire DataFrame, take() gets a number of rows and show() prints out a
number of rows
https://sparkbyexamples.com/pyspark-tutorial/
Data analysis operations
https://sparkbyexamples.com/pyspark-tutorial/
User-Defined Functions (UDF)
The logical plan represents a set of abstract transformations that do not refer to executors or drivers:
• Valid code results in an unresolved logical plan, however the tables or columns it refers to might not
exist yet.
• In the analyzer, table information from a repository (catalog) is used to resolve columns and tables,
producing the resolved logical plan and the catalyst optimiser produces the optimised logical plan.
Catalog
https://www.databricks.com/glossary/catalyst-optimizer
Structured API execution: Physical plan
The physical plan specifies how an optimised logical plan will be executed on a cluster as operations on
RDDs, by generating different planning strategies and comparing them through a cost model.
Physical
plan 1
Physical
plan N
Big Data Processing: Week 5
Topic List:
An RDD
• represents an immutable, partitioned collection of records that can be operated on in parallel,
• and its records are just objects (unlike DataFrames, which have a schema).
• - Manipulations and interactions need to be defined by hand (risk of reinventing the wheel)
• - Optimisations need to be implemented by hand
• - Using Python à UDFs needs serialization, which affects performance, so Python is not recommended
Partitions
• One partition per file block when creating from storage sources (blocks are 128MB by default in HDFS)
• The size can also be explicitly set when creating an RDD or through a transformation
Partitions
Both methods control the number of partitions and are useful for running operations more efficiently
Useful for balancing data across the partitions, for instance after filtering down a large dataset.
Memory management
• Once the results are obtained, RDDs can be discharged, however might be temporarily held in cache
• RDDs can be recreated from a chain of transformations à very costly to do so
Load error messages from a log into memory, then interactively search for patterns
lines = sc.textFile("/data...")
errors = lines.filter(lambda l: l.startswith("ERROR"))
messages = errors.map(lambda l: l. split("\t")[2])
cachedMsgs = messages.cache()
cachedMsgs.filter(lambda l: l.contains("foo")).count()
cachedMsgs.filter(lambda l: l.contains("bar")).count()
Example: Gradient Decent Algorithm
In the following example, the cache() method avoids loading the data in each iteration:
data = sc.textFile(...).map(readPoint).cache()
w = np.random.rand(D)
Cached so do not need to be recomputed/reloaded for each iteration
for i in range (1,ITERATIONS):
gradient = data.map(lambda p:
(1 / (1 + math.exp(-p[1]*(np.dot(w,p[0]))))- 1)*p[1]*p[0])
.reduce(lambda a, b: a + b)
w -= gradient
}
print("Final w: " + w)
Performance Issues
• All the added expressivity of Spark makes the task of efficiently allocating the different RDDs
much more challenging
• Memory tuning
• Much more prone to OutOfMemory errors than MapReduce.
• How much memory is needed for each RDD partition?
• How many partitions make sense for each RDD?
• What are the performance implications of each operation?
https://sparkbyexamples.com/pyspark-tutorial/
Creating RDDs
Any existing local collection can be converted to an RDD using parallelize(), the number of
partitions can be optionally set: # of partitions
sc.parallelize([1, 2, 3], 2) or
spark.sparkContext.parallelize([1, 2, 3], 2)
sc.textFile("/hdfspath/to/file")
In this example, the created RDD is a collection of lines. Other sc methods for external sources are
available such as JSON, CSV, .., etc.
https://spark.apache.org/docs/1.6.1/sql-programming-guide.html#data-sources
Narrow and wide operations
• Wide or shuffle operations, which involve records from multiple partitions and are costly. It is of the
same nature as shuffle in MapReduce.
Examples include groupByKey(), join() and reduceByKey().
Spark – MapReduce comparison
Spark MapReduce
map Map
flatMap Map
filter Map
Spark is more sample Map
expressive with
a more rich set
union Map (2 input)
of operations groupByKey Shuffle
reduceByKey ShuffleReduce
join ShuffleReduce
… …
Narrow Transformations
map(): creates a new RDD with the same number of records, each new record is the result
of applying the transformation function to the original record:
filter(): creates a new RDD with at most the same number of records of the original RDD.
A record is only transferred if the function returns true for the record
Both map() and filter() results have the same number of partitions as the source RDD.
Narrow transformations
flatMap(): creates a new RDD with a new collection. Each original record generates a variable
number of records when applying the transformation
When the type of the elements of an RDD is numeric (e.g. Integer or Double), Spark also provides
aggregated summarisation methods on the RDD
• union(): returns elements contained in either RDD that is combining elements from both RDD
The following methods may require shuffle and hence is costly to compute
• Cartesian(): returns all possible pairs from both sets. Base for performing joins à one of the
most expensive operations
Reduce Operations
reduce() is an action which returns to the driver one single value from the RDD
Group by transformations is similar to the shuffling taking place between MapReduce jobs
• combineByKey: similar to groupByKey but combines the elements for each key using a
custom set of aggregation functions. Turns an RDD[(K, V)] into a result of type RDD[(K, C)], for a
“combined type” C
reduceByKey() and aggeragateByKey() internally call combineByKey()
Joins needs a key
• Joins in Spark are implemented for Tuple RDDs and are performed by the tuple keys
• Joins are computed much faster if both RDDs have same # of partitions
Retrieving information
• Actions allow the driver program to retrieve values from the RDDs. Useful for algorithms, and
interactive applications