0% found this document useful (0 votes)
219 views

Spark Tuning

Spark parameters like num-executors, executor-cores, and executor-memory can be tuned to control Spark's resource usage. Dynamic allocation allows Spark to dynamically scale resources up and down based on workload. Tips for tuning a Spark program include preferring primitive types over collections, avoiding nested structures, using IDs over strings, and choosing the right level of parallelism. An example use case showed tuning a collaborative filtering algorithm's parameters reduced its runtime from 27 minutes to around 6.5 minutes.

Uploaded by

ajquinonesp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
219 views

Spark Tuning

Spark parameters like num-executors, executor-cores, and executor-memory can be tuned to control Spark's resource usage. Dynamic allocation allows Spark to dynamically scale resources up and down based on workload. Tips for tuning a Spark program include preferring primitive types over collections, avoiding nested structures, using IDs over strings, and choosing the right level of parallelism. An example use case showed tuning a collaborative filtering algorithm's parameters reduced its runtime from 27 minutes to around 6.5 minutes.

Uploaded by

ajquinonesp
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Tuning

Q4s Research Report


[email protected]

1
Agenda
1. Tuning Spark parameter
a. Control Sparks resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm

2
Tuning Spark Parameter

3
3
The easy way

If you Spark application is slow, just let it


have more system resources.
Is there anything simpler?

4
Spark Architecture Simplified

5
Control Sparks resource usage
spark-submit commands parameter (some only
available when using in YARN)
Parameter Description Default value

num-executor Number of executors to launch 2

executor-cores Number of cores per executor 1

executor-memory Memory per executor 1G

driver-cores Number of cores used by the driver, only in YARN 1


cluster mode

driver-memory Memory for driver 1G

6
Calculate the right values
For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submits parameters?
--num-executors 4 --executor-memory 63g --
executor-cores 15
--num-executors 7 --executor-memory 29GB --
executor-cores 7
--num-executors 11 --executor-memory 19GB --
executor-cores 5

7
Spark Executors Memory Model
Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)

8
Move advanced parameters
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and 0.2
cogroups during shuffles

spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously 48m


from each reduce task

spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created false


during a shuffle

spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output 32k
stream

spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6

spark.akka.frameSize Number of actor threads to use for communication 4

spark.akka.threads Maximum message size to allow in "control plane" 10


communication (for serialized tasks and task results), in
MB

9
Advanced Spark memory

10
Demo Spark UI

11
Using Dynamic Allocation
Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
Only available when using YARN as cluster
management tool
Must use an external shuffle service, so must config
a shuffle service with YARN

12
Dynamic Allocation parameters (1)
spark.shuffle.service.enabled Enables the external shuffle service. false
This service preserves the shuffle
files written by executors so the
executors can be safely removed

spark.dynamicAllocation.enabled Whether to use dynamic resource false


allocation

spark.dynamicAllocation. If an executor has been idle for more 60s


executorIdleTimeout than this duration, the executor will
be removed

spark.dynamicAllocation. If an executor which has cached Infinity


cachedExecutorIdleTimeout data blocks has been idle for more
than this duration, the executor will
be removed

spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor

13
Dynamic Allocation parameters (2)
spark.dynamicAllocation.maxExecutors Upper bound for the number of Infinity
executors

spark.dynamicAllocation.minExecutors Lower bound for the number of 0


executors

spark.dynamicAllocation. If there have been pending tasks 1s


schedulerBacklogTimeout backlogged for more than this
duration, new executors will be
requested

spark.dynamicAllocation. Same as spark.dynamicAllocation. schedulerBackl


sustainedSchedulerBacklogTimeout schedulerBacklogTimeout, but ogTimeout
used only for subsequent executor
requests

14
Dynamic Allocation in Action

15
Dynamic Allocation - The verdict
Dynamic Allocation help using your cluster resource
more efficiently
But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
In addition, when an executor is removed, all
cached data will no longer be accessible

16
Tips for Tuning Your Spark
Program

17
17
Tuning Memory Usage
Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
Avoid nested structures with a lot of small objects
and pointers when possible.
Using numeric IDs or enumeration objects instead of
strings for keys.
If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.

18
Other Tuning Tips (1)
Using KryoSerializer instead of default JavaSerilizer
Know when to persist RDD and determine the right
level of storage level
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER

Choose the right level of parallelism
spark.default.parallelism
repartition
2nd arguments for methods in spark.
PairRDDFunctions
19
Other tuning tips (2)
Broadcast large variables
Do not collect on large RDDs (should filter first)
Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey)
Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.

20
groupByKey vs reduceByKey (1)

21
groupByKey vs reduceByKey (2)

22
Example use case of tuning Spark
algorithm

23
23
Tuning CF algorithm in RW project
1st algorithm, no parameter tuning: 27mins
1st algorithm, parameters tuned: 18mins
2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s

24
Q&A

25
25
Thank You!

26

You might also like