SlideShare a Scribd company logo
Performance tuning
of Apache Spark
Melbourne
Apache Spark
meetup
Maksud Ibrahimov
February 2016
Who am I?
—  Chief Data Scientist at InfoReady - a leading Australian data &
analytics business
—  PhD in Artificial Intelligence from the University of Adelaide
—  Over last 10+ years, have worked and improved operations of
major companies in mining, manufacturing, retail and logistics
through machine learning, optimisation and simulation
—  Particular interest in applying these algorithms in cluster
computing environment, hence Apache Spark
—  User of Spark since 1.0 release
2
What is Apache Spark?
18/02/16 Performance tuning of Apache Spark 3
—  Apache Spark is a fast and general engine for large-scale data
processing.
—  Run programs up to 100x faster than Hadoop MapReduce in memory,
or 10x faster on disk.
—  Write applications quickly in Java, Scala, Python, R
—  SQL
—  Streaming
—  Graph processing
—  Machine learning
How easy it is to write programs in Spark?
—  Fairly easy to start. Within 1-2 day you can start writing simple
programs on a single machine
—  Not too hard to deploy and run on the cluster with preconfigured
deployment options, such as Amazon EMR or Hortonworks distribution
—  Once you start writing programs that run in a cluster with a few nodes
you may notice that the execution is not as fast as was expected
18/02/16 Performance tuning of Apache Spark 4
Generally performance can be improved by tuning
the following areas
•  Partitioning. Do you take full advantage of parallel capabilities of Spark?
Do you spill to disk?
•  Runtime configuration. Is your configuration tuned to your task?
•  Optimal code. Can you perform your computation more efficiently?
Algorithm complexity analysis.
•  Cluster and hardware. What hardware and how many nodes do I need?
How to run jobs quicker while keeping costs down?
•  Persistence. Do you perform unnecessary recomputes by failing to cache
rdds?
•  Isolating bottlenecks. How do you find which resource is your
bottleneck? Block-time analysis?
18/02/16 Performance tuning of Apache Spark 5
Key concepts to understand for performance tuning
—  Spark performance metrics
—  Memory model
—  Partitioning
—  DAG and shuffles
—  Persistence
18/02/16 Performance tuning of Apache Spark 6
Spark programs consist of jobs, stages and tasks
—  Each Spark program runs as a job
—  DAG scheduler splits jobs into stages
—  Tasks belong to a stage. Task is a unit of work to run on
executor, correspond to a single partition
—  Each task either partitions its output for “shuffle”, or
sends the output back to the driver
18/02/16 Performance tuning of Apache Spark 7
Job
Stage 1
Task
Task
Stage 2
Task
Task
Task
Task
Stage 1
Task 1
Task 2
Task 3
Shuffle anatomy
18/02/16 Performance tuning of Apache Spark 8
Stage 2
Task 1
Task 2
Task 3
Shuffle readShuffle write
—  Shuffle redistributes data among partitions
—  Files are written to disk by the end of one stage
—  Read by next stage
—  Reducing number of shuffles will generally improve
performance
Spark memory model
—  Execution memory: shuffles
—  Storage memory: caching
—  Pre 1.6.0 had to manually configure memory ratios
—  1.6.0: unified memory management
18/02/16 Sample Infoready Presentation 9
Storage	Execu-on	 File	system
How to debug performance?
—  Web UI is your friend
—  Failed executors. JVM crashes, memory issues, config issues, network
—  Identify stragglers
•  Is a particular node running slow? Turn speculation on
•  Data skew: max >> median
•  GC issues
•  Jstack, jmap, or UI stack dump
—  Recomputation
—  Rdd.toDebugString or WebUI
—  Metrics to watch
•  GC time. Lots of them gone in Spark 1.6 due to Tungsten
•  Disk spill
18/02/16 Performance tuning of Apache Spark 10
Using UI to find the cause of the skew
18/02/16 Performance tuning of Apache Spark 11
Find the problematic partition. Majority of such
problems are related to disk I/O
18/02/16 Performance tuning of Apache Spark 12
18/02/16 Performance tuning of Apache Spark 13
Cause:	
rdd.persist(StorageLevel.MEMORY_AND_DISK)
The same RDD can be split differntly
18/02/16 Performance tuning of Apache Spark 14
…	
100	GB	RDD,	4	par--ons,	25	GB	each	
100	GB	RDD,	100	par--ons,	1	GB	each	
25	GB	 25	GB	 25	GB	 25	GB
18/02/16 Sample Infoready Presentation 15
Spilling to disk
Small tasks vs Large tasks
18/02/16 Sample Infoready Presentation 16
Execu-on	memory	
Disk spill
Core	1	 Core	2	 Core	3	 Core	4	
Execu-on	memory	
Core	1	 Core	2	 Core	3	 Core	4	
Task
Task Task Task Task
Task Task Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Task
Tasks pool Tasks pool
1	sec	each	
task	
60	sec	each	
task
Partitions
—  Partitions determine a degree of parallelism
—  Too big partition
•  Small amount of tasks and long time to execute each
•  More memory needed per task => disk spills
•  More chances for data skew
—  Too small partition
•  overhead of launching the task dominates the runtime of a task
—  Rule of thumb: task runtime just under 1s execution per task and more
than 100ms
—  To control number of partitions use coalesce() and repartition(). Note
that both may trigger shuffle. The case when additional shuffle in the
beginning may improve performance
18/02/16 Performance tuning of Apache Spark 17
Partitions
18/02/16 Performance tuning of Apache Spark 18
Number	of	par--ons	
Execu-on	-me	
Minimum	
Op-mal	
par--on	size	
range
Partitioning strategies
—  Fixed number of partitions
•  numPartitions = 100
—  Fixed number records per partition
•  numPartitions = rdd.count / 100
—  Fixed memory size of a partition. Calculate number of partitions based
on the memory consumption of rdd or sample of rdd
•  partitionSpace = 100 Mb
•  rowsPerPartition = partitionSpace / meanRowSpace(rdd)
•  numPartitions = rdd.count / rowsPerPartition
18/02/16 Sample Infoready Presentation 19
CPU vs Memory. What should I add to increase
performance of my job?
—  Parameters I can play with:
•  Number of cores per node
•  Amount of memory per node
•  Both are related to number of nodes
•  Amount of storage. Normally not a problem
•  Network. Normally fixed
18/02/16 Performance tuning of Apache Spark 20
vs
Where is your bottleneck?
18/02/16 Performance tuning of Apache Spark 21
RAM	
CPU	
Network	 Storage	I/O
Block-time analysis
18/02/16 Sample Infoready Presentation 22
Source:	Kay	Ousterhout,	Spark	Summit	2015
How job completion would change if the network is
infinitely fast
18/02/16 Sample Infoready Presentation 23
Source:	Kay	Ousterhout,	Spark	Summit	2015
How we do it?
—  The goal is to make CPU utilisation each of the nodes at 100%
—  Limit your resources to a single core and just 1G of memory
—  Run the job on the subset of the data, so the full job runs in about a
minute. This way you can iterate through your tuning experiments
much faster.
—  Tweak the memory and partition size until no disk spills. This is the
amount of memory needed per core
—  Scale cores and memory proportionally
—  Make sure the partition size is the same as for the full job
18/02/16 Performance tuning of Apache Spark 24
Key lessons
—  Understand the memory model
—  Avoid expensive shuffles if possible
—  Choose number/size of partitions
—  Use persistence when reusing RDDs
—  Do not spill to disk
—  Make the job CPU bound and scale for performance
—  Experiment on small subsets and limited resources
18/02/16 Performance tuning of Apache Spark 25
Thank you!
26
Ad

More Related Content

What's hot (20)

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache spark
Apache sparkApache spark
Apache spark
TEJPAL GAUTAM
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
GMO-Z.com Vietnam Lab Center
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper OptimizationApache Spark Core—Deep Dive—Proper Optimization
Apache Spark Core—Deep Dive—Proper Optimization
Databricks
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
Ilya Ganelin
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Cloudera, Inc.
 
Understanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And ProfitUnderstanding Memory Management In Spark For Fun And Profit
Understanding Memory Management In Spark For Fun And Profit
Spark Summit
 
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Advanced Apache Spark Meetup Project Tungsten Nov 12 2015
Chris Fregly
 
How We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IOHow We Optimize Spark SQL Jobs With parallel and sync IO
How We Optimize Spark SQL Jobs With parallel and sync IO
Databricks
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
Daniel Abadi
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Apache spark 소개 및 실습
Apache spark 소개 및 실습Apache spark 소개 및 실습
Apache spark 소개 및 실습
동현 강
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
Sandy Ryza
 

Viewers also liked (20)

Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
Sergey Arkhipov
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
Sergey Arkhipov
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian Ozsvald
PyData
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
GlobalLogic Ukraine
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
晨揚 施
 
Python profiling
Python profilingPython profiling
Python profiling
dreampuf
 
Exploiting GPUs in Spark
Exploiting GPUs in SparkExploiting GPUs in Spark
Exploiting GPUs in Spark
Kazuaki Ishizaki
 
Scaling spark
Scaling sparkScaling spark
Scaling spark
Alex Rovner
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in Spark
Spark Summit
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
Jon Haddad
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkThe Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Akshay Rai
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 
Debugging & Tuning in Spark
Debugging & Tuning in SparkDebugging & Tuning in Spark
Debugging & Tuning in Spark
Shiao-An Yuan
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
Patrick Wendell
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...How to Boost 100x Performance for Real World Application with Apache Spark-(G...
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan PuSpark Summit EU talk by Qifan Pu
Spark Summit EU talk by Qifan Pu
Spark Summit
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
Piotr Przymus
 
Vasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python ProfilingVasiliy Litvinov - Python Profiling
Vasiliy Litvinov - Python Profiling
Sergey Arkhipov
 
Denis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python PerformanceDenis Nagorny - Pumping Python Performance
Denis Nagorny - Pumping Python Performance
Sergey Arkhipov
 
The High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian OzsvaldThe High Performance Python Landscape by Ian Ozsvald
The High Performance Python Landscape by Ian Ozsvald
PyData
 
Boost.Python: C++ and Python Integration
Boost.Python: C++ and Python IntegrationBoost.Python: C++ and Python Integration
Boost.Python: C++ and Python Integration
GlobalLogic Ukraine
 
Spark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance TuningSpark + Scikit Learn- Performance Tuning
Spark + Scikit Learn- Performance Tuning
晨揚 施
 
Python profiling
Python profilingPython profiling
Python profiling
dreampuf
 
The Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in SparkThe Potential of GPU-driven High Performance Data Analytics in Spark
The Potential of GPU-driven High Performance Data Analytics in Spark
Spark Summit
 
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015Spark After Dark - LA Apache Spark Users Group - Feb 2015
Spark After Dark - LA Apache Spark Users Group - Feb 2015
Chris Fregly
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
Jon Haddad
 
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and SparkThe Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
The Fifth Elephant 2016: Self-Serve Performance Tuning for Hadoop and Spark
Akshay Rai
 
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production ScaleGPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
GPU Support in Spark and GPU/CPU Mixed Resource Scheduling at Production Scale
sparktc
 
Ad

Similar to Spark performance tuning - Maksud Ibrahimov (20)

Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Databricks
 
Apache Spark PDF
Apache Spark PDFApache Spark PDF
Apache Spark PDF
Naresh Rupareliya
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Spark Summit
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
Nidhin Pattaniyil
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
20140708hcj
20140708hcj20140708hcj
20140708hcj
Hatayama Hideharu
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
Adarsh Pannu
 
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Transparent GPU Exploitation on Apache Spark with Kazuaki Ishizaki and Madhus...
Databricks
 
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
Sohil Jain
 
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Near Data Computing Architectures for Apache Spark: Challenges and Opportunit...
Spark Summit
 
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache SparkNear Data Computing Architectures: Opportunities and Challenges for Apache Spark
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Ahsan Javed Awan
 
Beginner Apache Spark Presentation
Beginner Apache Spark PresentationBeginner Apache Spark Presentation
Beginner Apache Spark Presentation
Nidhin Pattaniyil
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Databricks
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
mahchiev
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Apache Spark Performance Observations
Apache Spark Performance ObservationsApache Spark Performance Observations
Apache Spark Performance Observations
Adam Roberts
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
DataWorks Summit
 
Spark autotuning talk final
Spark autotuning talk finalSpark autotuning talk final
Spark autotuning talk final
Rachel Warren
 
Ad

Recently uploaded (20)

presentation of first program exist.pptx
presentation of first program exist.pptxpresentation of first program exist.pptx
presentation of first program exist.pptx
MajidAzeemChohan
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136Call illuminati Agent in uganda+256776963507/0741506136
Call illuminati Agent in uganda+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 
presentation of first program exist.pptx
presentation of first program exist.pptxpresentation of first program exist.pptx
presentation of first program exist.pptx
MajidAzeemChohan
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Simple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptxSimple_AI_Explanation_English somplr.pptx
Simple_AI_Explanation_English somplr.pptx
ssuser2aa19f
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Digilocker under workingProcess Flow.pptx
Digilocker  under workingProcess Flow.pptxDigilocker  under workingProcess Flow.pptx
Digilocker under workingProcess Flow.pptx
satnamsadguru491
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
KNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptxKNN_Logistic_Regression_Presentation_Styled.pptx
KNN_Logistic_Regression_Presentation_Styled.pptx
sonujha1980712
 
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
i_o updated.pptx 6=₹cnjxifj,lsbd ধ and vjcjcdbgjfu n smn u cut the lb, it ও o...
ggg032019
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Day 1 - Lab 1 Reconnaissance Scanning with NMAP, Vulnerability Assessment wit...
Abodahab
 
Cleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdfCleaned_Lecture 6666666_Simulation_I.pdf
Cleaned_Lecture 6666666_Simulation_I.pdf
alcinialbob1234
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
shit yudh slideshare power likha point presen
shit yudh slideshare power likha point presenshit yudh slideshare power likha point presen
shit yudh slideshare power likha point presen
vishalgurjar11229
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Geometry maths presentation for begginers
Geometry maths presentation for begginersGeometry maths presentation for begginers
Geometry maths presentation for begginers
zrjacob283
 

Spark performance tuning - Maksud Ibrahimov

  • 1. Performance tuning of Apache Spark Melbourne Apache Spark meetup Maksud Ibrahimov February 2016
  • 2. Who am I? —  Chief Data Scientist at InfoReady - a leading Australian data & analytics business —  PhD in Artificial Intelligence from the University of Adelaide —  Over last 10+ years, have worked and improved operations of major companies in mining, manufacturing, retail and logistics through machine learning, optimisation and simulation —  Particular interest in applying these algorithms in cluster computing environment, hence Apache Spark —  User of Spark since 1.0 release 2
  • 3. What is Apache Spark? 18/02/16 Performance tuning of Apache Spark 3 —  Apache Spark is a fast and general engine for large-scale data processing. —  Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. —  Write applications quickly in Java, Scala, Python, R —  SQL —  Streaming —  Graph processing —  Machine learning
  • 4. How easy it is to write programs in Spark? —  Fairly easy to start. Within 1-2 day you can start writing simple programs on a single machine —  Not too hard to deploy and run on the cluster with preconfigured deployment options, such as Amazon EMR or Hortonworks distribution —  Once you start writing programs that run in a cluster with a few nodes you may notice that the execution is not as fast as was expected 18/02/16 Performance tuning of Apache Spark 4
  • 5. Generally performance can be improved by tuning the following areas •  Partitioning. Do you take full advantage of parallel capabilities of Spark? Do you spill to disk? •  Runtime configuration. Is your configuration tuned to your task? •  Optimal code. Can you perform your computation more efficiently? Algorithm complexity analysis. •  Cluster and hardware. What hardware and how many nodes do I need? How to run jobs quicker while keeping costs down? •  Persistence. Do you perform unnecessary recomputes by failing to cache rdds? •  Isolating bottlenecks. How do you find which resource is your bottleneck? Block-time analysis? 18/02/16 Performance tuning of Apache Spark 5
  • 6. Key concepts to understand for performance tuning —  Spark performance metrics —  Memory model —  Partitioning —  DAG and shuffles —  Persistence 18/02/16 Performance tuning of Apache Spark 6
  • 7. Spark programs consist of jobs, stages and tasks —  Each Spark program runs as a job —  DAG scheduler splits jobs into stages —  Tasks belong to a stage. Task is a unit of work to run on executor, correspond to a single partition —  Each task either partitions its output for “shuffle”, or sends the output back to the driver 18/02/16 Performance tuning of Apache Spark 7 Job Stage 1 Task Task Stage 2 Task Task Task Task
  • 8. Stage 1 Task 1 Task 2 Task 3 Shuffle anatomy 18/02/16 Performance tuning of Apache Spark 8 Stage 2 Task 1 Task 2 Task 3 Shuffle readShuffle write —  Shuffle redistributes data among partitions —  Files are written to disk by the end of one stage —  Read by next stage —  Reducing number of shuffles will generally improve performance
  • 9. Spark memory model —  Execution memory: shuffles —  Storage memory: caching —  Pre 1.6.0 had to manually configure memory ratios —  1.6.0: unified memory management 18/02/16 Sample Infoready Presentation 9 Storage Execu-on File system
  • 10. How to debug performance? —  Web UI is your friend —  Failed executors. JVM crashes, memory issues, config issues, network —  Identify stragglers •  Is a particular node running slow? Turn speculation on •  Data skew: max >> median •  GC issues •  Jstack, jmap, or UI stack dump —  Recomputation —  Rdd.toDebugString or WebUI —  Metrics to watch •  GC time. Lots of them gone in Spark 1.6 due to Tungsten •  Disk spill 18/02/16 Performance tuning of Apache Spark 10
  • 11. Using UI to find the cause of the skew 18/02/16 Performance tuning of Apache Spark 11
  • 12. Find the problematic partition. Majority of such problems are related to disk I/O 18/02/16 Performance tuning of Apache Spark 12
  • 13. 18/02/16 Performance tuning of Apache Spark 13 Cause: rdd.persist(StorageLevel.MEMORY_AND_DISK)
  • 14. The same RDD can be split differntly 18/02/16 Performance tuning of Apache Spark 14 … 100 GB RDD, 4 par--ons, 25 GB each 100 GB RDD, 100 par--ons, 1 GB each 25 GB 25 GB 25 GB 25 GB
  • 15. 18/02/16 Sample Infoready Presentation 15 Spilling to disk
  • 16. Small tasks vs Large tasks 18/02/16 Sample Infoready Presentation 16 Execu-on memory Disk spill Core 1 Core 2 Core 3 Core 4 Execu-on memory Core 1 Core 2 Core 3 Core 4 Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Task Tasks pool Tasks pool 1 sec each task 60 sec each task
  • 17. Partitions —  Partitions determine a degree of parallelism —  Too big partition •  Small amount of tasks and long time to execute each •  More memory needed per task => disk spills •  More chances for data skew —  Too small partition •  overhead of launching the task dominates the runtime of a task —  Rule of thumb: task runtime just under 1s execution per task and more than 100ms —  To control number of partitions use coalesce() and repartition(). Note that both may trigger shuffle. The case when additional shuffle in the beginning may improve performance 18/02/16 Performance tuning of Apache Spark 17
  • 18. Partitions 18/02/16 Performance tuning of Apache Spark 18 Number of par--ons Execu-on -me Minimum Op-mal par--on size range
  • 19. Partitioning strategies —  Fixed number of partitions •  numPartitions = 100 —  Fixed number records per partition •  numPartitions = rdd.count / 100 —  Fixed memory size of a partition. Calculate number of partitions based on the memory consumption of rdd or sample of rdd •  partitionSpace = 100 Mb •  rowsPerPartition = partitionSpace / meanRowSpace(rdd) •  numPartitions = rdd.count / rowsPerPartition 18/02/16 Sample Infoready Presentation 19
  • 20. CPU vs Memory. What should I add to increase performance of my job? —  Parameters I can play with: •  Number of cores per node •  Amount of memory per node •  Both are related to number of nodes •  Amount of storage. Normally not a problem •  Network. Normally fixed 18/02/16 Performance tuning of Apache Spark 20 vs
  • 21. Where is your bottleneck? 18/02/16 Performance tuning of Apache Spark 21 RAM CPU Network Storage I/O
  • 22. Block-time analysis 18/02/16 Sample Infoready Presentation 22 Source: Kay Ousterhout, Spark Summit 2015
  • 23. How job completion would change if the network is infinitely fast 18/02/16 Sample Infoready Presentation 23 Source: Kay Ousterhout, Spark Summit 2015
  • 24. How we do it? —  The goal is to make CPU utilisation each of the nodes at 100% —  Limit your resources to a single core and just 1G of memory —  Run the job on the subset of the data, so the full job runs in about a minute. This way you can iterate through your tuning experiments much faster. —  Tweak the memory and partition size until no disk spills. This is the amount of memory needed per core —  Scale cores and memory proportionally —  Make sure the partition size is the same as for the full job 18/02/16 Performance tuning of Apache Spark 24
  • 25. Key lessons —  Understand the memory model —  Avoid expensive shuffles if possible —  Choose number/size of partitions —  Use persistence when reusing RDDs —  Do not spill to disk —  Make the job CPU bound and scale for performance —  Experiment on small subsets and limited resources 18/02/16 Performance tuning of Apache Spark 25