0% found this document useful (0 votes)

219 views

Spark Tuning

Spark parameters like num-executors, executor-cores, and executor-memory can be tuned to control Spark's resource usage. Dynamic allocation allows Spark to dynamically scale resources up and down based on workload. Tips for tuning a Spark program include preferring primitive types over collections, avoiding nested structures, using IDs over strings, and choosing the right level of parallelism. An example use case showed tuning a collaborative filtering algorithm's parameters reduced its runtime from 27 minutes to around 6.5 minutes.

Uploaded by

ajquinonesp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

219 views

Spark Tuning

Uploaded by

ajquinonesp

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Tuning

Q4s Research Report

[email protected]

1
Agenda
1. Tuning Spark parameter
a. Control Sparks resource usage
b. Advanced Parameter
c. Dynamic Allocation
2. Tips for tuning your Spark program
3. Example use case of tuning Spark
algorithm

2
Tuning Spark Parameter

3
3
The easy way

If you Spark application is slow, just let it

have more system resources.
Is there anything simpler?

4
Spark Architecture Simplified

5
Control Sparks resource usage
spark-submit commands parameter (some only
available when using in YARN)
Parameter Description Default value

num-executor Number of executors to launch 2

executor-cores Number of cores per executor 1

executor-memory Memory per executor 1G

driver-cores Number of cores used by the driver, only in YARN 1

cluster mode

driver-memory Memory for driver 1G

6
Calculate the right values
For example: 4 servers for Spark, each server has
64gb ram, 16 cores. How should we set those spark-
submits parameters?
--num-executors 4 --executor-memory 63g --
executor-cores 15
--num-executors 7 --executor-memory 29GB --
executor-cores 7
--num-executors 11 --executor-memory 19GB --
executor-cores 5

7
Spark Executors Memory Model
Memory request from YARN for each container =
spark.executor.memory + spark.yarn.executor.
memoryOverhead
spark.yarn.executor.memoryOverhead = max(spark.
executor.memory * 0.1, 384mb)

8
Move advanced parameters
spark.shuffle.memoryFraction Fraction of Java heap to use for aggregation and 0.2
cogroups during shuffles

spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously 48m

from each reduce task

spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created false

during a shuffle

spark.shuffle.file.buffer Size of the in-memory buffer for each shuffle file output 32k
stream

spark.storage.memoryFraction Fraction of Java heap to use for Spark's memory cache 0.6

spark.akka.frameSize Number of actor threads to use for communication 4

spark.akka.threads Maximum message size to allow in "control plane" 10

communication (for serialized tasks and task results), in
MB

9
Advanced Spark memory

10
Demo Spark UI

11
Using Dynamic Allocation
Dynamically scale the set of cluster resources
allocated to your application up and down based on
the workload
Only available when using YARN as cluster
management tool
Must use an external shuffle service, so must config
a shuffle service with YARN

12
Dynamic Allocation parameters (1)
spark.shuffle.service.enabled Enables the external shuffle service. false
This service preserves the shuffle
files written by executors so the
executors can be safely removed

spark.dynamicAllocation.enabled Whether to use dynamic resource false

allocation

spark.dynamicAllocation. If an executor has been idle for more 60s

executorIdleTimeout than this duration, the executor will
be removed

spark.dynamicAllocation. If an executor which has cached Infinity

cachedExecutorIdleTimeout data blocks has been idle for more
than this duration, the executor will
be removed

spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor

13
Dynamic Allocation parameters (2)
spark.dynamicAllocation.maxExecutors Upper bound for the number of Infinity
executors

spark.dynamicAllocation.minExecutors Lower bound for the number of 0

executors

spark.dynamicAllocation. If there have been pending tasks 1s

schedulerBacklogTimeout backlogged for more than this
duration, new executors will be
requested

spark.dynamicAllocation. Same as spark.dynamicAllocation. schedulerBackl

sustainedSchedulerBacklogTimeout schedulerBacklogTimeout, but ogTimeout
used only for subsequent executor
requests

14
Dynamic Allocation in Action

15
Dynamic Allocation - The verdict
Dynamic Allocation help using your cluster resource
more efficiently
But only effective when Spark Application is a long
running one with different long stages with different
number of tasks (Spark Streaming?)
In addition, when an executor is removed, all
cached data will no longer be accessible

16
Tips for Tuning Your Spark
Program

17
17
Tuning Memory Usage
Prefer arrays of objects, and primitive types, instead
of the standard Java or Scala collection classes (e.g.
HashMap).
Avoid nested structures with a lot of small objects
and pointers when possible.
Using numeric IDs or enumeration objects instead of
strings for keys.
If you have less than 32 GB of RAM, set the JVM flag
-XX:+UseCompressedOops to make pointers be four
bytes instead of eight.

18
Other Tuning Tips (1)
Using KryoSerializer instead of default JavaSerilizer
Know when to persist RDD and determine the right
level of storage level
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER

Choose the right level of parallelism
spark.default.parallelism
repartition
2nd arguments for methods in spark.
PairRDDFunctions
19
Other tuning tips (2)
Broadcast large variables
Do not collect on large RDDs (should filter first)
Careful when using operation that require data
shuffle (join, reduceByKey, groupByKey)
Avoid groupByKey, use reduceByKey or
aggregateByKey or combineByKey (low level) if
possible.

20
groupByKey vs reduceByKey (1)

21
groupByKey vs reduceByKey (2)

22
Example use case of tuning Spark
algorithm

23
23
Tuning CF algorithm in RW project
1st algorithm, no parameter tuning: 27mins
1st algorithm, parameters tuned: 18mins
2nd algorithm (from Spark code), parameters tuned:
~ 7mins 30s
3nd algorithm (improved Spark code), parameters
tuned: ~6mins 30s

24
Q&A

25
25
Thank You!

Manual SAB 128-163 MK4 PDF
100% (3)
Manual SAB 128-163 MK4 PDF
56 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Databricks Delta Guide
No ratings yet
Databricks Delta Guide
11 pages
Revit DOs and DONTs
No ratings yet
Revit DOs and DONTs
3 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Spark Interview Questions
100% (1)
Spark Interview Questions
7 pages
Top Answers To Spark Interview Questions
No ratings yet
Top Answers To Spark Interview Questions
4 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
Spark With Bigdata
No ratings yet
Spark With Bigdata
94 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Spark Interview Questions: Click Here
No ratings yet
Spark Interview Questions: Click Here
35 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Apache Spark Streaming Presentation
100% (1)
Apache Spark Streaming Presentation
28 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
ADB Course Catalog
No ratings yet
ADB Course Catalog
84 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
Spark Structured Streaming
No ratings yet
Spark Structured Streaming
655 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
No ratings yet
"Analytics Using Apache Spark": (Lightening Fast Cluster Computing)
99 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Distributed Database Systems: - Spark I
No ratings yet
Distributed Database Systems: - Spark I
59 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
BD - Spark - Baladasu A - SightSpectrum
No ratings yet
BD - Spark - Baladasu A - SightSpectrum
3 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Data Engineering Roadmap 2023
No ratings yet
Data Engineering Roadmap 2023
1 page
Mongodb Spark
No ratings yet
Mongodb Spark
13 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Big Data Processing With Apache Spark
No ratings yet
Big Data Processing With Apache Spark
17 pages
Spark Interview Questions and Answers
100% (3)
Spark Interview Questions and Answers
31 pages
PySpark Notes
No ratings yet
PySpark Notes
29 pages
Apache Hive Interview Questions
50% (2)
Apache Hive Interview Questions
6 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Airflow Introduction
No ratings yet
Airflow Introduction
9 pages
Sqoop User Guide
No ratings yet
Sqoop User Guide
58 pages
Interactive Visual Data Exploration With Spark in Databricks Cloud
No ratings yet
Interactive Visual Data Exploration With Spark in Databricks Cloud
26 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
From Everand
Ultimate AWS Certified Solutions Architect Associate Exam Guide: Master Designing Resilient, Scalable Architectures with Core and Advanced AWS Services to Crack the SAA-C03 Certification (English Edition)
Venkata Sasi Kanumuri
No ratings yet
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Kubernetes A Complete Guide
From Everand
Kubernetes A Complete Guide
Gerardus Blokdyk
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Bridgesampling: An R Package For Estimating Normalizing Constants
No ratings yet
Bridgesampling: An R Package For Estimating Normalizing Constants
31 pages
The Bmamevt Package: Bayesian Model Averaging at Work For Multivariate Extremes
No ratings yet
The Bmamevt Package: Bayesian Model Averaging at Work For Multivariate Extremes
5 pages
Introduction To Bonev Package: Dongmei Li February 12, 2016
No ratings yet
Introduction To Bonev Package: Dongmei Li February 12, 2016
3 pages
Bnstruct: An R Package For Bayesian Network Structure Learning With Missing Data
No ratings yet
Bnstruct: An R Package For Bayesian Network Structure Learning With Missing Data
26 pages
ARTP (Adaptive Rank Truncated Product) Package: Detailed Examples of Computing The Gene and Path-Way P-Values
No ratings yet
ARTP (Adaptive Rank Truncated Product) Package: Detailed Examples of Computing The Gene and Path-Way P-Values
3 pages
Benzecri Distances (2D) Rows
No ratings yet
Benzecri Distances (2D) Rows
1 page
Jhs 11 Forms
No ratings yet
Jhs 11 Forms
51 pages
Spark 20 Tuning Guide
No ratings yet
Spark 20 Tuning Guide
21 pages
TWacs Basics
No ratings yet
TWacs Basics
116 pages
Patient Demographics and Multivariable
No ratings yet
Patient Demographics and Multivariable
3 pages
Data Mining of Agricultural Yield Data - A Comparison of Regression Models
No ratings yet
Data Mining of Agricultural Yield Data - A Comparison of Regression Models
15 pages
R PDF Tables
No ratings yet
R PDF Tables
4 pages
R Maps
No ratings yet
R Maps
36 pages
Cascadexpert Help
No ratings yet
Cascadexpert Help
15 pages
MPMC Iii-I
No ratings yet
MPMC Iii-I
247 pages
Polar Loop Manual
No ratings yet
Polar Loop Manual
21 pages
Free Touch Typing Resources Grades
No ratings yet
Free Touch Typing Resources Grades
3 pages
Huawei OceanStor 2600 V3 Storage System Product Description
100% (1)
Huawei OceanStor 2600 V3 Storage System Product Description
142 pages
An377 PDF
No ratings yet
An377 PDF
18 pages
Chapter 3 Handout
No ratings yet
Chapter 3 Handout
23 pages
Indian Institute of Technology KHARAGPUR - 721302: WWW - Iitkgp.ac - in
No ratings yet
Indian Institute of Technology KHARAGPUR - 721302: WWW - Iitkgp.ac - in
22 pages
At-180 Instruction Service
100% (1)
At-180 Instruction Service
9 pages
Eighty Eight 2.0 - User Guide - V1.0 - RP
No ratings yet
Eighty Eight 2.0 - User Guide - V1.0 - RP
22 pages
Performing P2V Migration For Software Assurance
No ratings yet
Performing P2V Migration For Software Assurance
4 pages
Report Statspack 2day
No ratings yet
Report Statspack 2day
4 pages
Fire Alarm XFP User Manual 2
100% (1)
Fire Alarm XFP User Manual 2
28 pages
Amswms Code 0782b
No ratings yet
Amswms Code 0782b
3 pages
Mar1989 Statics Corrections
No ratings yet
Mar1989 Statics Corrections
14 pages
Product and Services Guidelines
No ratings yet
Product and Services Guidelines
28 pages
Quick Start Guide: Industrial Automation
No ratings yet
Quick Start Guide: Industrial Automation
27 pages
ZUPCASTE LETVE I ZUPCANICI
No ratings yet
ZUPCASTE LETVE I ZUPCANICI
3 pages
128-Mbit, Low-Voltage, Serial Flash Memory With 54-Mhz Spi Bus Interface
No ratings yet
128-Mbit, Low-Voltage, Serial Flash Memory With 54-Mhz Spi Bus Interface
47 pages
Data For May 10
No ratings yet
Data For May 10
13 pages
1C-130H-4-22-1 Automatic Flight Control System
No ratings yet
1C-130H-4-22-1 Automatic Flight Control System
76 pages
Courier Service Project Report
No ratings yet
Courier Service Project Report
81 pages
3ds Max 6 Bible 5
No ratings yet
3ds Max 6 Bible 5
11 pages
N950
No ratings yet
N950
2 pages
Loadcap en
No ratings yet
Loadcap en
35 pages
MG15HCFX Microstack HBK
No ratings yet
MG15HCFX Microstack HBK
5 pages
Parts Catalogue - OXYGEN SLANT - Sensys EP - v1.3 - en
No ratings yet
Parts Catalogue - OXYGEN SLANT - Sensys EP - v1.3 - en
34 pages

Spark Tuning

Uploaded by

Spark Tuning

Uploaded by

Tuning

Q4s Research Report

If you Spark application is slow, just let it

num-executor Number of executors to launch 2

executor-cores Number of cores per executor 1

executor-memory Memory per executor 1G

driver-cores Number of cores used by the driver, only in YARN 1

driver-memory Memory for driver 1G

spark.reducer.maxSizeInFlight Maximum size of map outputs to fetch simultaneously 48m

spark.shuffle.consolidateFiles If set to "true", consolidates intermediate files created false

spark.akka.frameSize Number of actor threads to use for communication 4

spark.akka.threads Maximum message size to allow in "control plane" 10

spark.dynamicAllocation.enabled Whether to use dynamic resource false

spark.dynamicAllocation. If an executor has been idle for more 60s

spark.dynamicAllocation. If an executor which has cached Infinity

spark.dynamicAllocation.initialExecutors Initial number of executors to run minExecutor

spark.dynamicAllocation.minExecutors Lower bound for the number of 0

spark.dynamicAllocation. If there have been pending tasks 1s

spark.dynamicAllocation. Same as spark.dynamicAllocation. schedulerBackl

You might also like