Fine Tuning and Enhancing Performance of Apache Spark Jobs

Fine Tuning and Enhancing
Performance of Apache Spark Jobs
Blake Becerra, Kira Lindke, Kaushik Tadikonda

Our Setup
▪ Data Validation Tool for ETL
▪ Millions of comparisons and aggregations
▪ One of the larger datasets initially took 4+ hours, unstable
▪ Challenge: improve reliability and performance
▪ Months of research and tuning, same application takes 35 minutes

Configuring Cluster
▪ Test changes with
▪ spark.driver.cores
▪ spark.driver.memory
▪ executor-memory
▪ executor-cores
▪ spark.cores.max
▪ reserve cores (~1 core per node) for
daemons
▪ num_executors = total_cores /
num_cores
▪ num_partitions
▪ Too much memory per executor can
result in excessive GC delays
▪ Too little memory can lose benefits
from running multiple tasks in a single
JVM
▪ Look at stats (network, CPU, memory,
etc. and tweak to improve
performance)

Skew
▪ Can severely downgrade
performance
▪ Extreme imbalance of work in the cluster
▪ Tasks within a stage (like a join) take uneven
amounts of time to finish
▪ How to check
▪ Spark UI shows job waiting only on some of the
tasks
▪ Look for large variances in memory usage within a
job (primarily works if it is at the beginning of the job
and doing data ingestion – otherwise can be
misleading)
▪ Executor missing heartbeat
▪ Check partition sizes of RDD while debugging to
confirm

Example – skew in ingestion of data
Uneven partitions Even partitions

Handling skew cont’d
▪ Blind repartition is the most
naïve approach but effective
▪ Great for "narrow transformations"
▪ Good for increasing the number of partitions
▪ Use coalesce not repartition to decrease
partitions
▪ Choosing a better approach for
joins specifically coming later

Handling skew - ingestion
▪ Use spark options to do partitioned reads with JDBC
▪ partitionColumn
▪ lower/upperBound - used to determine stride
▪ numPartitions – maximum number of partitions
▪ partitionColumn
▪ Ideally is not skewed (such as primary key)
▪ Stride can have skew
▪ Possible trick – mod function
▪ Example of working with slow jdbc databases:
▪ Initial query took ~40 minutes
▪ Took it down to 10 minutes

Example
lowerBound: 0
upperBound: 1000
numPartitions: 10
=> Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM tableName WHERE partitionModColumn < 100
SELECT * FROM tableName WHERE partitionModColumn >= 100 AND
partitionModColumn < 200
...
SELECT * FROM tableName WHERE partitionModColumn >= 900

Also works with: partitioned data ingestion
Example – reading from object storage, files, etc.
▪ Reads in with same skew
▪ Read in then repartition to get even distribution

Cache/Persist
▪ Reuse a DataFrame with
transformations
▪ Unpersist when done
▪ Without cache, DataFrame is built
from scratch each time
▪ Don't over persist
▪ Worse memory performance
▪ Possible slowdown
▪ GC pressure

Other Performance
Improvements
▪ Try seq.par.foreach instead of
just seq.foreach
▪ Increases parallelization
▪ Race conditions and non-deterministic results
▪ Use accumulators or synchronization to protect
▪ Avoid UDFs if possible
▪ Deserialize every row to object
▪ Apply lambda
▪ Then reserialize it
▪ More garbage generated

Join Optimization
▪ Different join strategies
▪ Data preprocessing - clever
filtering
▪ Avoiding unnecessary scans
▪ Locality - use of same partitioner
▪ Salting - skewed join keys

Filter Trick
Included in Spark 3.0

Things to remember
▪ Follow good partitioning strategies
▪ Too few partitions – less parallelism
▪ Too many partitions – scheduling issues
▪ Improve scan efficiency
▪ Try to use same partitioner
between DataFrames for join
▪ Skipped stages are good
▪ Caching prevents repeated
exchanges where data is re-used

Fair Scheduling
▪ Round Robin fashion
▪ Multiple jobs can run
simultaneously if submitted by
threads inside an application
▪ Better resource utilization
▪ Harder to debug and makes
performance tuning more
difficult

Fair Scheduling and
Pools
▪ Enabled by setting
spark.scheduler.mode to FAIR
▪ Allows jobs to be grouped into
pools with prioritization
options
▪ Jobs created from threads
without a pool defined always
go to the “default” pool
▪ Pools, weight and minShare
are specified in
fairscheduler.xml

Serialization
▪ Default for most types
▪ Can work with any class
▪ More flexible
▪ Slower
▪ Default for shuffling RDDs with simple
types
▪ Significantly faster and more compact
▪ Set your SparkConf to
use "org.apache.spark.serializer.KryoSe
rializer"
▪ Register classes in order to take full
advantage
Kryo SerializerJava Serializer
https://spark.apache.org/docs/latest/tuning.html#data-serialization

CompressedOOPs
▪ JVM flag –XX:UseCompressedOops
allows usage of 4-byte pointers
instead of 8-byte
▪ Does not work as heap size grows
beyond 32 GB
▪ Need 48 GB heap without compressed
pointers to hold same number of
objects as 32 GB

Garbage Collection
Tuning
▪ Contention when allocating
more objects
▪ Full GC occurring before tasks
finish
▪ Too many RDDs cached
▪ Frequent garbage collection
▪ Long time spent in GC
▪ Missed heartbeats

Enable GC Logging
-XX:+UseParallelGC
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UseG1GC
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAdaptiveSizePolicy
-XX:+G1SummarizeConcMark
G1GC:ParallelGC:

ParallelGC (default)
Heap space is divided into Young and Old
▪ Minor GC occurs frequently:
▪ Increase Eden and Survivor space
▪ Major GC occurs frequently:
▪ Increase young space if too many objects are being promoted
▪ Increase old space
▪ Decrease spark.memory.fraction
▪ Full GC occurs before task finishes:
▪ Try clearing young space by frequent GC triggers
▪ Increase memory

G1GC
▪ Breaks heap into thousands of regions
▪ Recommended if heap sizes are
large (> 8GB) and GC times are long
▪ Lower latency traded with higher
CPU usage for GC bookkeeping
For large scale applications, these properties help preventing full
collections:
▪ -XX:ParallelGCThreads = n
▪ -XX:ConcGCThreads = [n, 2n]
▪ -XX:InitiatingHeapOccupancyPercent = 35

-XX:+UseG1GC
-XX:ParallelGCThreads=8
-XX:ConcGCThreads=16
-XX:InitiatingHeapOccupancyPercent=35
-XX:+UseG1GC

Takeaways
▪ Performance tuning is iterative
▪ Tuning is case by case
▪ Take advantage of the Spark UI, logs, and available monitoring
▪ Focus on the major slowdowns, not on one particular trick
▪ You can't be perfect

Thank you for your time!
Questions

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Recommended

More Related Content

What's hot (20)

Similar to Fine Tuning and Enhancing Performance of Apache Spark Jobs (20)

More from Databricks (20)

Recently uploaded (20)

Fine Tuning and Enhancing Performance of Apache Spark Jobs