Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King

Carl Steinbach (LinkedIn)
Simon King (Pepperdata)
Dr. Elephant for Monitoring &
Tuning Apache Spark Jobs on
Hadoop

Hadoop @ LinkedIn c. 2015
• > 10 clusters
• > 10,000 nodes
• > 1,000 users
• Thousands of queries and flows in development
• Spark, Pig, Hive, Scalding, Gobblin, Cubert, ...
2

What we learned along the way
Scaling Hadoop Infrastructure is Hard
Scaling User Productivity is Harder
3

Some things we tried
• Training
– doesn’t scale
– interferes with productivity
• Expert Review
– doesn’t scale
– long wait times
5

What does Dr. Elephant do?
• Performance monitoring and tuning service
• Finds common mistakes
• Provide actionable advice
• Compare performance changes over time
7

Dr. Elephant User Interface
10

Outline
• Spark Event Logs and Spark History Server
• Dr. Elephant for Spark
• Pepperdata’s Application Profiler
simon@pepperdata.com
12

Spark Event Logs
{"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task
ID":775,"Index":55,"Attempt":0,"LaunchTime":1495496382885,"ExecutorID":"9","Host":"amarillo-
rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496481595,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"76154696","Value":"601560113","Internal":true}]},"Task Metrics":{"Host
Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":11,"Executor Run Time":98690,"Result Size":1366,"JVM GC Time":51928,"Result Serialization
Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":114,"LocalBlocks Fetched":6,"Fetch Wait Time":5,"Remote Bytes
Read":743674,"Local Bytes Read":41686,"Total Records Read":120}}}
n1.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496487808,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"96536946","Value":"698097059","Internal":true}]},"Task Metrics":{"Host
Name":"amarillo-n1.pepperdata.com","Executor Deserialize Time":4,"Executor Run Time":104915,"Result Size":1366,"JVM GC Time":68939,"Result Serialization
rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496507584,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"105946616","Value":"804043675","Internal":true}]},"Task
Metrics":{"Host Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":9,"Executor Run Time":124690,"Result Size":1366,"JVM GC Time":81294,"Result Serialization
16

Spark Application Heuristics
18

19

20

1: Configuration Heuristics
• Display some basic config settings for your app
• Complain if some settings not explicitly set
• Recommend configuring an external shuffle
service (especially if dynamic allocation is
enabled)
• These recommendations won’t change over
multiple runs of an application
21

2: Stages and Jobs Heuristics
• Simple alarms showing stage and job failure rates
• Good for seeing when there’s a problem
22

3: Executor Heuristics
• Looks at the distribution across executors of
several different metrics
• Outliers in these distributions probably indicate:
– Suboptimal partitioning.
– One or more slow executors due to external
circumstances (cluster weather)
23

3: Partitions Heuristic
• Ideally data for each task will fit into the RAM
available to that task.
• Cloudera has an excellent blog on Spark tuning:
(observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores)
(observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
24

More Heuristics?
Yes, please! Dr. Elephant is open source.
25

Pepperdata
• Capacity Optimizer
• Policy Enforcer
• Cluster Analyzer
26

Pepperdata
• Capacity Optimizer
• Policy Enforcer
• Cluster Analyzer
• Application Profiler
29
Mostly for Operators
For Developers

Application Profiler
• Benefits to our users:
– Provide simple answers to simple questions
– Combination of metrics for experts, simple actionable
insights for users
– Pepperdata support
• Why stay close to open source?
– Heuristics
30

Application Profiler, Hardware and Cluster Weather
31

Application Profiler, Hardware and Cluster Weather
32

Thanks!
Stop by the Pepperdata booth (#101)
Come to the Dr Elephant Meetup:
6:00 PM Wednesday, June 7, 2017
LinkedIn San Francisco Office
222 2nd Street, San Francisco
Get involved with Dr. Elephant:
https://github.com/linkedin/dr-elephant
Contact us:
simon@pepperdata.com, csteinbach@linkedin.com

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King

More Related Content

What's hot (20)

Similar to Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King (20)

More from Databricks (20)

Recently uploaded (20)

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King