SlideShare a Scribd company logo
Carl Steinbach (LinkedIn)
Simon King (Pepperdata)
Dr. Elephant for Monitoring &
Tuning Apache Spark Jobs on
Hadoop
Hadoop @ LinkedIn c. 2015
• > 10 clusters
• > 10,000 nodes
• > 1,000 users
• Thousands of queries and flows in development
• Spark, Pig, Hive, Scalding, Gobblin, Cubert, ...
2
What we learned along the way
Scaling Hadoop Infrastructure is Hard
Scaling User Productivity is Harder
3
4
Tuning Hadoop and Spark
Some things we tried
• Training
– doesn’t scale
– interferes with productivity
• Expert Review
– doesn’t scale
– long wait times
5
Birth of Dr. Elephant!
6
What does Dr. Elephant do?
• Performance monitoring and tuning service
• Finds common mistakes
• Provide actionable advice
• Compare performance changes over time
7
Architecture
8
Dr. Elephant User Interface
9
Dr. Elephant User Interface
10
Dr. Elephant Community
11
Outline
• Spark Event Logs and Spark History Server
• Dr. Elephant for Spark
• Pepperdata’s Application Profiler
simon@pepperdata.com
12
Spark History Server
13
Spark History Server
14
Spark Event Logs
15
Spark Event Logs
{"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task
ID":775,"Index":55,"Attempt":0,"LaunchTime":1495496382885,"ExecutorID":"9","Host":"amarillo-
rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496481595,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"76154696","Value":"601560113","Internal":true}]},"Task Metrics":{"Host
Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":11,"Executor Run Time":98690,"Result Size":1366,"JVM GC Time":51928,"Result Serialization
Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":114,"LocalBlocks Fetched":6,"Fetch Wait Time":5,"Remote Bytes
Read":743674,"Local Bytes Read":41686,"Total Records Read":120}}}
{"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task
ID":770,"Index":50,"Attempt":0,"LaunchTime":1495496382879,"ExecutorID":"8","Host":"amarillo-
n1.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496487808,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"96536946","Value":"698097059","Internal":true}]},"Task Metrics":{"Host
Name":"amarillo-n1.pepperdata.com","Executor Deserialize Time":4,"Executor Run Time":104915,"Result Size":1366,"JVM GC Time":68939,"Result Serialization
Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":111,"LocalBlocks Fetched":9,"Fetch Wait Time":10,"Remote Bytes
Read":921999,"Local Bytes Read":74622,"Total Records Read":120}}}
{"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task
ID":769,"Index":49,"Attempt":0,"LaunchTime":1495496382874,"ExecutorID":"5","Host":"amarillo-
rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish
Time":1495496507584,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"105946616","Value":"804043675","Internal":true}]},"Task
Metrics":{"Host Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":9,"Executor Run Time":124690,"Result Size":1366,"JVM GC Time":81294,"Result Serialization
Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":108,"LocalBlocks Fetched":12,"Fetch Wait Time":2,"Remote Bytes
Read":972911,"Local Bytes Read":113196,"Total Records Read":120}}}
16
Dr. Elephant
17
Spark Application Heuristics
18
Spark Application Heuristics
19
Spark Application Heuristics
20
1: Configuration Heuristics
• Display some basic config settings for your app
• Complain if some settings not explicitly set
• Recommend configuring an external shuffle
service (especially if dynamic allocation is
enabled)
• These recommendations won’t change over
multiple runs of an application
21
2: Stages and Jobs Heuristics
• Simple alarms showing stage and job failure rates
• Good for seeing when there’s a problem
22
3: Executor Heuristics
• Looks at the distribution across executors of
several different metrics
• Outliers in these distributions probably indicate:
– Suboptimal partitioning.
– One or more slow executors due to external
circumstances (cluster weather)
23
3: Partitions Heuristic
• Ideally data for each task will fit into the RAM
available to that task.
• Cloudera has an excellent blog on Spark tuning:
(observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores)
(observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction)
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
24
More Heuristics?
Yes, please! Dr. Elephant is open source.
25
Pepperdata
• Capacity Optimizer
• Policy Enforcer
• Cluster Analyzer
26
Pepperdata
27
Pepperdata
28
Pepperdata
• Capacity Optimizer
• Policy Enforcer
• Cluster Analyzer
• Application Profiler
29
Mostly for Operators
For Developers
Application Profiler
• Benefits to our users:
– Provide simple answers to simple questions
– Combination of metrics for experts, simple actionable
insights for users
– Pepperdata support
• Why stay close to open source?
– Heuristics
30
Application Profiler, Hardware and Cluster Weather
31
Application Profiler, Hardware and Cluster Weather
32
Thanks!
Stop by the Pepperdata booth (#101)
Come to the Dr Elephant Meetup:
6:00 PM Wednesday, June 7, 2017
LinkedIn San Francisco Office
222 2nd Street, San Francisco
Get involved with Dr. Elephant:
https://github.com/linkedin/dr-elephant
Contact us:
simon@pepperdata.com, csteinbach@linkedin.com

More Related Content

What's hot (20)

PDF
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
PDF
Simplifying Big Data Analytics with Apache Spark
Databricks
 
PDF
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PDF
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
PDF
Spark shuffle introduction
colorant
 
PDF
The Parquet Format and Performance Optimization Opportunities
Databricks
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PPTX
Node Labels in YARN
DataWorks Summit
 
PDF
Introduction to Apache Spark
Anastasios Skarlatidis
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PPTX
Elastic Stack Introduction
Vikram Shinde
 
PDF
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
PDF
How to Automate Performance Tuning for Apache Spark
Databricks
 
PDF
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
PDF
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
PPTX
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Simplifying Big Data Analytics with Apache Spark
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Introduction to Apache Spark
Rahul Jain
 
Intro to Airflow: Goodbye Cron, Welcome scheduled workflow management
Burasakorn Sabyeying
 
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Databricks
 
Spark shuffle introduction
colorant
 
The Parquet Format and Performance Optimization Opportunities
Databricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Node Labels in YARN
DataWorks Summit
 
Introduction to Apache Spark
Anastasios Skarlatidis
 
Programming in Spark using PySpark
Mostafa
 
Spark architecture
GauravBiswas9
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Elastic Stack Introduction
Vikram Shinde
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
How to Automate Performance Tuning for Apache Spark
Databricks
 
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
MeetUp Monitoring with Prometheus and Grafana (September 2018)
Lucas Jellema
 

Similar to Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King (20)

PDF
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
PPTX
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Akshay Rai
 
PPTX
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Akshay Rai
 
PDF
Hadoop & Spark Performance tuning using Dr. Elephant
Akshay Rai
 
PPTX
Metrics-driven tuning of Apache Spark at scale
DataWorks Summit
 
PDF
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
PDF
Spark Autotuning - Spark Summit East 2017
Alpine Data
 
PDF
Use Machine Learning to Get the Most out of Your Big Data Clusters
Databricks
 
PDF
Spark Autotuning - Strata EU 2018
Holden Karau
 
PDF
Spark Autotuning Talk - Strata New York
Holden Karau
 
PPTX
Understanding Spark Tuning: Strata New York
Rachel Warren
 
PDF
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Spark Summit
 
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
PPTX
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
PPTX
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
PPTX
Spark Tips & Tricks
Jason Hubbard
 
PPTX
Spark autotuning talk final
Rachel Warren
 
PDF
Hadoop to spark-v2
Sujee Maniyam
 
PDF
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 
PDF
Hadoop to spark_v2
elephantscale
 
Apache Spark Performance is too hard. Let's make it easier
Databricks
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Akshay Rai
 
Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...
Akshay Rai
 
Hadoop & Spark Performance tuning using Dr. Elephant
Akshay Rai
 
Metrics-driven tuning of Apache Spark at scale
DataWorks Summit
 
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Spark Summit
 
Spark Autotuning - Spark Summit East 2017
Alpine Data
 
Use Machine Learning to Get the Most out of Your Big Data Clusters
Databricks
 
Spark Autotuning - Strata EU 2018
Holden Karau
 
Spark Autotuning Talk - Strata New York
Holden Karau
 
Understanding Spark Tuning: Strata New York
Rachel Warren
 
Dr. Elephant: Achieving Quicker, Easier, and Cost-Effective Big Data Analytic...
Spark Summit
 
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Databricks
 
Apache Spark in Scientific Applciations
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
Spark Tips & Tricks
Jason Hubbard
 
Spark autotuning talk final
Rachel Warren
 
Hadoop to spark-v2
Sujee Maniyam
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
DataWorks Summit
 
Hadoop to spark_v2
elephantscale
 
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
Databricks
 
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
PPT
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 2
Databricks
 
PPTX
Data Lakehouse Symposium | Day 4
Databricks
 
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
PDF
Democratizing Data Quality Through a Centralized Platform
Databricks
 
PDF
Learn to Use Databricks for Data Science
Databricks
 
PDF
Why APM Is Not the Same As ML Monitoring
Databricks
 
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
PDF
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
PDF
Sawtooth Windows for Feature Aggregations
Databricks
 
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
PDF
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
PDF
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
PDF
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

PDF
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
PDF
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
PPTX
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
PPTX
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
PPTX
big data eco system fundamentals of data science
arivukarasi
 
DOCX
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
PDF
Research Methodology Overview Introduction
ayeshagul29594
 
PDF
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
PDF
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
PPTX
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
PDF
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
PDF
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
PDF
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
PPTX
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
PDF
SQL for Accountants and Finance Managers
ysmaelreyes
 
PDF
Group 5_RMB Final Project on circular economy
pgban24anmola
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
UNISE-Operation-Procedure-InDHIS2trainng
ahmedabduselam23
 
Technical-Report-GPS_GIS_RS-for-MSF-finalv2.pdf
KPycho
 
04_Tamás Marton_Intuitech .pptx_AI_Barometer_2025
FinTech Belgium
 
thid ppt defines the ich guridlens and gives the information about the ICH gu...
shaistabegum14
 
big data eco system fundamentals of data science
arivukarasi
 
INDUSTRIAL BENEFIT FROM MICROSOFT AZURE.docx
writercontent500
 
Research Methodology Overview Introduction
ayeshagul29594
 
Business implication of Artificial Intelligence.pdf
VishalChugh12
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Feb 2021 Ransomware Recovery presentation.pptx
enginsayin1
 
5- Global Demography Concepts _ Population Pyramids .pdf
pkhadka824
 
在线购买英国本科毕业证苏格兰皇家音乐学院水印成绩单RSAMD学费发票
Taqyea
 
apidays Singapore 2025 - How APIs can make - or break - trust in your AI by S...
apidays
 
Unlocking Insights: Introducing i-Metrics Asia-Pacific Corporation and Strate...
Janette Toral
 
5991-5857_Agilent_MS_Theory_EN (1).pdf. pdf
NohaSalah45
 
03_Ariane BERCKMOES_Ethias.pptx_AIBarometer_release_event
FinTech Belgium
 
SQL for Accountants and Finance Managers
ysmaelreyes
 
Group 5_RMB Final Project on circular economy
pgban24anmola
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 

Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl Steinbach and Simon King

  • 1. Carl Steinbach (LinkedIn) Simon King (Pepperdata) Dr. Elephant for Monitoring & Tuning Apache Spark Jobs on Hadoop
  • 2. Hadoop @ LinkedIn c. 2015 • > 10 clusters • > 10,000 nodes • > 1,000 users • Thousands of queries and flows in development • Spark, Pig, Hive, Scalding, Gobblin, Cubert, ... 2
  • 3. What we learned along the way Scaling Hadoop Infrastructure is Hard Scaling User Productivity is Harder 3
  • 5. Some things we tried • Training – doesn’t scale – interferes with productivity • Expert Review – doesn’t scale – long wait times 5
  • 6. Birth of Dr. Elephant! 6
  • 7. What does Dr. Elephant do? • Performance monitoring and tuning service • Finds common mistakes • Provide actionable advice • Compare performance changes over time 7
  • 9. Dr. Elephant User Interface 9
  • 10. Dr. Elephant User Interface 10
  • 12. Outline • Spark Event Logs and Spark History Server • Dr. Elephant for Spark • Pepperdata’s Application Profiler [email protected] 12
  • 16. Spark Event Logs {"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":775,"Index":55,"Attempt":0,"LaunchTime":1495496382885,"ExecutorID":"9","Host":"amarillo- rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1495496481595,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"76154696","Value":"601560113","Internal":true}]},"Task Metrics":{"Host Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":11,"Executor Run Time":98690,"Result Size":1366,"JVM GC Time":51928,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":114,"LocalBlocks Fetched":6,"Fetch Wait Time":5,"Remote Bytes Read":743674,"Local Bytes Read":41686,"Total Records Read":120}}} {"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":770,"Index":50,"Attempt":0,"LaunchTime":1495496382879,"ExecutorID":"8","Host":"amarillo- n1.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1495496487808,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"96536946","Value":"698097059","Internal":true}]},"Task Metrics":{"Host Name":"amarillo-n1.pepperdata.com","Executor Deserialize Time":4,"Executor Run Time":104915,"Result Size":1366,"JVM GC Time":68939,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":111,"LocalBlocks Fetched":9,"Fetch Wait Time":10,"Remote Bytes Read":921999,"Local Bytes Read":74622,"Total Records Read":120}}} {"Event":"SparkListenerTaskEnd","Stage ID":9,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":769,"Index":49,"Attempt":0,"LaunchTime":1495496382874,"ExecutorID":"5","Host":"amarillo- rm.pepperdata.com","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1495496507584,"Failed":false,"Accumulables":[{"ID":7,"Name":"peakExecutionMemory","Update":"105946616","Value":"804043675","Internal":true}]},"Task Metrics":{"Host Name":"amarillo-rm.pepperdata.com","Executor Deserialize Time":9,"Executor Run Time":124690,"Result Size":1366,"JVM GC Time":81294,"Result Serialization Time":0,"Memory Bytes Spilled":0,"Disk Bytes Spilled":0,"Shuffle Read Metrics":{"Remote Blocks Fetched":108,"LocalBlocks Fetched":12,"Fetch Wait Time":2,"Remote Bytes Read":972911,"Local Bytes Read":113196,"Total Records Read":120}}} 16
  • 21. 1: Configuration Heuristics • Display some basic config settings for your app • Complain if some settings not explicitly set • Recommend configuring an external shuffle service (especially if dynamic allocation is enabled) • These recommendations won’t change over multiple runs of an application 21
  • 22. 2: Stages and Jobs Heuristics • Simple alarms showing stage and job failure rates • Good for seeing when there’s a problem 22
  • 23. 3: Executor Heuristics • Looks at the distribution across executors of several different metrics • Outliers in these distributions probably indicate: – Suboptimal partitioning. – One or more slow executors due to external circumstances (cluster weather) 23
  • 24. 3: Partitions Heuristic • Ideally data for each task will fit into the RAM available to that task. • Cloudera has an excellent blog on Spark tuning: (observed shuffle write) * (observed shuffle spill memory) * (spark.executor.cores) (observed shuffle spill disk) * (spark.executor.memory) * (spark.shuffle.memoryFraction) * (spark.shuffle.safetyFraction) http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/ 24
  • 25. More Heuristics? Yes, please! Dr. Elephant is open source. 25
  • 26. Pepperdata • Capacity Optimizer • Policy Enforcer • Cluster Analyzer 26
  • 29. Pepperdata • Capacity Optimizer • Policy Enforcer • Cluster Analyzer • Application Profiler 29 Mostly for Operators For Developers
  • 30. Application Profiler • Benefits to our users: – Provide simple answers to simple questions – Combination of metrics for experts, simple actionable insights for users – Pepperdata support • Why stay close to open source? – Heuristics 30
  • 31. Application Profiler, Hardware and Cluster Weather 31
  • 32. Application Profiler, Hardware and Cluster Weather 32
  • 33. Thanks! Stop by the Pepperdata booth (#101) Come to the Dr Elephant Meetup: 6:00 PM Wednesday, June 7, 2017 LinkedIn San Francisco Office 222 2nd Street, San Francisco Get involved with Dr. Elephant: https://github.com/linkedin/dr-elephant Contact us: [email protected], [email protected]