Spark Tuning
Spark Tuning
1
Agenda
1. Tuning Spark parameter
a. Control Sparks resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm
2
Tuning Spark Parameter
3
3
The easy way
4
Spark Architecture Simplified
5
Control Sparks resource usage
spark-submit commands parameter (some only
available when using in YARN)
Parameter Description Default value
6
Calculate the right values
For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submits parameters?
--num-executors 4 --executor-memory 63g --
executor-cores 15
--num-executors 7 --executor-memory 29GB --
executor-cores 7
--num-executors 11 --executor-memory 19GB --
executor-cores 5
7
Spark Executors Memory Model
Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)
8
Move advanced parameters
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and 0.2
cogroups during shuffles
spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output 32k
stream
spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6
9
Advanced Spark memory
10
Demo Spark UI
11
Using Dynamic Allocation
Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
Only available when using YARN as cluster
management tool
Must use an external shuffle service, so must config
a shuffle service with YARN
12
Dynamic Allocation parameters (1)
spark.shuffle.service.enabled Enables the external shuffle service. false
This service preserves the shuffle
files written by executors so the
executors can be safely removed
13
Dynamic Allocation parameters (2)
spark.dynamicAllocation.maxExecutors Upper bound for the number of Infinity
executors
14
Dynamic Allocation in Action
15
Dynamic Allocation - The verdict
Dynamic Allocation help using your cluster resource
more efficiently
But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
In addition, when an executor is removed, all
cached data will no longer be accessible
16
Tips for Tuning Your Spark
Program
17
17
Tuning Memory Usage
Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
Avoid nested structures with a lot of small objects
and pointers when possible.
Using numeric IDs or enumeration objects instead of
strings for keys.
If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.
18
Other Tuning Tips (1)
Using KryoSerializer instead of default JavaSerilizer
Know when to persist RDD and determine the right
level of storage level
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
Choose the right level of parallelism
spark.default.parallelism
repartition
2nd arguments for methods in spark.
PairRDDFunctions
19
Other tuning tips (2)
Broadcast large variables
Do not collect on large RDDs (should filter first)
Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey)
Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.
20
groupByKey vs reduceByKey (1)
21
groupByKey vs reduceByKey (2)
22
Example use case of tuning Spark
algorithm
23
23
Tuning CF algorithm in RW project
1st algorithm, no parameter tuning: 27mins
1st algorithm, parameters tuned: 18mins
2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s
24
Q&A
25
25
Thank You!
26