SlideShare a Scribd company logo
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing
Performance of Apache Spark Jobs
Blake Becerra, Kira Lindke, Kaushik Tadikonda
Our Setup
▪ Data Validation Tool for ETL
▪ Millions of comparisons and aggregations
▪ One of the larger datasets initially took 4+ hours, unstable
▪ Challenge: improve reliability and performance
▪ Months of research and tuning, same application takes 35 minutes
Configuring Cluster
▪ Test changes with
▪ spark.driver.cores
▪ spark.driver.memory
▪ executor-memory
▪ executor-cores
▪ spark.cores.max
▪ reserve cores (~1 core per node) for
daemons
▪ num_executors = total_cores /
num_cores
▪ num_partitions
▪ Too much memory per executor can
result in excessive GC delays
▪ Too little memory can lose benefits
from running multiple tasks in a single
JVM
▪ Look at stats (network, CPU, memory,
etc. and tweak to improve
performance)
Skew
▪ Can severely downgrade
performance
▪ Extreme imbalance of work in the cluster
▪ Tasks within a stage (like a join) take uneven
amounts of time to finish
▪ How to check
▪ Spark UI shows job waiting only on some of the
tasks
▪ Look for large variances in memory usage within a
job (primarily works if it is at the beginning of the job
and doing data ingestion – otherwise can be
misleading)
▪ Executor missing heartbeat
▪ Check partition sizes of RDD while debugging to
confirm
Example
Example – skew in ingestion of data
Uneven partitions Even partitions
Handling skew cont’d
▪ Blind repartition is the most
naïve approach but effective
▪ Great for "narrow transformations"
▪ Good for increasing the number of partitions
▪ Use coalesce not repartition to decrease
partitions
▪ Choosing a better approach for
joins specifically coming later
Handling skew - ingestion
▪ Use spark options to do partitioned reads with JDBC
▪ partitionColumn
▪ lower/upperBound - used to determine stride
▪ numPartitions – maximum number of partitions
▪ partitionColumn
▪ Ideally is not skewed (such as primary key)
▪ Stride can have skew
▪ Possible trick – mod function
▪ Example of working with slow jdbc databases:
▪ Initial query took ~40 minutes
▪ Took it down to 10 minutes
Example
lowerBound: 0
upperBound: 1000
numPartitions: 10
=> Stride is equal to 100 and partitions correspond to following queries:
SELECT * FROM tableName WHERE partitionModColumn < 100
SELECT * FROM tableName WHERE partitionModColumn >= 100 AND
partitionModColumn < 200
...
SELECT * FROM tableName WHERE partitionModColumn >= 900
Also works with: partitioned data ingestion
Example – reading from object storage, files, etc.
▪ Reads in with same skew
▪ Read in then repartition to get even distribution
Cache/Persist
▪ Reuse a DataFrame with
transformations
▪ Unpersist when done
▪ Without cache, DataFrame is built
from scratch each time
▪ Don't over persist
▪ Worse memory performance
▪ Possible slowdown
▪ GC pressure
Other Performance
Improvements
▪ Try seq.par.foreach instead of
just seq.foreach
▪ Increases parallelization
▪ Race conditions and non-deterministic results
▪ Use accumulators or synchronization to protect
▪ Avoid UDFs if possible
▪ Deserialize every row to object
▪ Apply lambda
▪ Then reserialize it
▪ More garbage generated
Join Optimization
▪ Different join strategies
▪ Data preprocessing - clever
filtering
▪ Avoiding unnecessary scans
▪ Locality - use of same partitioner
▪ Salting - skewed join keys
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Filter Trick
Included in Spark 3.0
Salting – Reduce Skew
Things to remember
▪ Follow good partitioning strategies
▪ Too few partitions – less parallelism
▪ Too many partitions – scheduling issues
▪ Improve scan efficiency
▪ Try to use same partitioner
between DataFrames for join
▪ Skipped stages are good
▪ Caching prevents repeated
exchanges where data is re-used
Fair Scheduling
▪ Round Robin fashion
▪ Multiple jobs can run
simultaneously if submitted by
threads inside an application
▪ Better resource utilization
▪ Harder to debug and makes
performance tuning more
difficult
Fair Scheduling and
Pools
▪ Enabled by setting
spark.scheduler.mode to FAIR
▪ Allows jobs to be grouped into
pools with prioritization
options
▪ Jobs created from threads
without a pool defined always
go to the “default” pool
▪ Pools, weight and minShare
are specified in
fairscheduler.xml
FAIR
FIFO
Serialization
▪ Default for most types
▪ Can work with any class
▪ More flexible
▪ Slower
▪ Default for shuffling RDDs with simple
types
▪ Significantly faster and more compact
▪ Set your SparkConf to
use "org.apache.spark.serializer.KryoSe
rializer"
▪ Register classes in order to take full
advantage
Kryo SerializerJava Serializer
https://spark.apache.org/docs/latest/tuning.html#data-serialization
CompressedOOPs
▪ JVM flag –XX:UseCompressedOops
allows usage of 4-byte pointers
instead of 8-byte
▪ Does not work as heap size grows
beyond 32 GB
▪ Need 48 GB heap without compressed
pointers to hold same number of
objects as 32 GB
Garbage Collection
Tuning
▪ Contention when allocating
more objects
▪ Full GC occurring before tasks
finish
▪ Too many RDDs cached
▪ Frequent garbage collection
▪ Long time spent in GC
▪ Missed heartbeats
Garbage Collection Tuning
Enable GC Logging
-XX:+UseParallelGC
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UseG1GC
-verbose:gc
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintAdaptiveSizePolicy
-XX:+G1SummarizeConcMark
G1GC:ParallelGC:
ParallelGC (default)
Heap space is divided into Young and Old
▪ Minor GC occurs frequently:
▪ Increase Eden and Survivor space
▪ Major GC occurs frequently:
▪ Increase young space if too many objects are being promoted
▪ Increase old space
▪ Decrease spark.memory.fraction
▪ Full GC occurs before task finishes:
▪ Try clearing young space by frequent GC triggers
▪ Increase memory
G1GC
▪ Breaks heap into thousands of regions
▪ Recommended if heap sizes are
large (> 8GB) and GC times are long
▪ Lower latency traded with higher
CPU usage for GC bookkeeping
For large scale applications, these properties help preventing full
collections:
▪ -XX:ParallelGCThreads = n
▪ -XX:ConcGCThreads = [n, 2n]
▪ -XX:InitiatingHeapOccupancyPercent = 35
-XX:+UseG1GC
-XX:ParallelGCThreads=8
-XX:ConcGCThreads=16
-XX:InitiatingHeapOccupancyPercent=35
-XX:+UseG1GC
Takeaways
▪ Performance tuning is iterative
▪ Tuning is case by case
▪ Take advantage of the Spark UI, logs, and available monitoring
▪ Focus on the major slowdowns, not on one particular trick
▪ You can't be perfect
Thank you for your time!
Questions
Ad

More Related Content

What's hot (20)

Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
Databricks
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Databricks
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
Dat Tran
 
Enabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache SparkEnabling Vectorized Engine in Apache Spark
Enabling Vectorized Engine in Apache Spark
Kazuaki Ishizaki
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Databricks
 
Hive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas PatilHive Bucketing in Apache Spark with Tejas Patil
Hive Bucketing in Apache Spark with Tejas Patil
Databricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
colorant
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
Databricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark Summit
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
Venkata Naga Ravi
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
Databricks
 

Similar to Fine Tuning and Enhancing Performance of Apache Spark Jobs (20)

From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Spark Tips & Tricks
Spark Tips & TricksSpark Tips & Tricks
Spark Tips & Tricks
Jason Hubbard
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Evolution Of MongoDB Replicaset
Evolution Of MongoDB ReplicasetEvolution Of MongoDB Replicaset
Evolution Of MongoDB Replicaset
M Malai
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Atwix
 
Shootout at the AWS Corral
Shootout at the AWS CorralShootout at the AWS Corral
Shootout at the AWS Corral
PostgreSQL Experts, Inc.
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
PivotalOpenSourceHub
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
Marius Adrian Popa
 
How to Make SQL Server Go Faster
How to Make SQL Server Go FasterHow to Make SQL Server Go Faster
How to Make SQL Server Go Faster
Brent Ozar
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Considerations when deploying Java on Kubernetes
Considerations when deploying Java on KubernetesConsiderations when deploying Java on Kubernetes
Considerations when deploying Java on Kubernetes
superserch
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
Databricks
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
Rose Toomey
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
Rose Toomey
 
Evolution Of MongoDB Replicaset
Evolution Of MongoDB ReplicasetEvolution Of MongoDB Replicaset
Evolution Of MongoDB Replicaset
M Malai
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
Mark Kromer
 
In-memory Data Management Trends & Techniques
In-memory Data Management Trends & TechniquesIn-memory Data Management Trends & Techniques
In-memory Data Management Trends & Techniques
Hazelcast
 
Evolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best PracticesEvolution of MongoDB Replicaset and Its Best Practices
Evolution of MongoDB Replicaset and Its Best Practices
Mydbops
 
In-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great TasteIn-memory Caching in HDFS: Lower Latency, Same Great Taste
In-memory Caching in HDFS: Lower Latency, Same Great Taste
DataWorks Summit
 
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Host and Boast: Best Practices for Magento Hosting | Imagine 2013 Technolog…
Atwix
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
IBM
 
#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design#GeodeSummit - Off-Heap Storage Current and Future Design
#GeodeSummit - Off-Heap Storage Current and Future Design
PivotalOpenSourceHub
 
Tuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy WorkloadTuning Linux Windows and Firebird for Heavy Workload
Tuning Linux Windows and Firebird for Heavy Workload
Marius Adrian Popa
 
How to Make SQL Server Go Faster
How to Make SQL Server Go FasterHow to Make SQL Server Go Faster
How to Make SQL Server Go Faster
Brent Ozar
 
Tuning tips for Apache Spark Jobs
Tuning tips for Apache Spark JobsTuning tips for Apache Spark Jobs
Tuning tips for Apache Spark Jobs
Samir Bessalah
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
Databricks
 
Considerations when deploying Java on Kubernetes
Considerations when deploying Java on KubernetesConsiderations when deploying Java on Kubernetes
Considerations when deploying Java on Kubernetes
superserch
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 
GenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.aiGenAI for Quant Analytics: survey-analytics.ai
GenAI for Quant Analytics: survey-analytics.ai
Inspirient
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your CompetitorsAI Competitor Analysis: How to Monitor and Outperform Your Competitors
AI Competitor Analysis: How to Monitor and Outperform Your Competitors
Contify
 
Deloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit contextDeloitte Analytics - Applying Process Mining in an audit context
Deloitte Analytics - Applying Process Mining in an audit context
Process mining Evangelist
 
VKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptxVKS-Python-FIe Handling text CSV Binary.pptx
VKS-Python-FIe Handling text CSV Binary.pptx
Vinod Srivastava
 
VKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptxVKS-Python Basics for Beginners and advance.pptx
VKS-Python Basics for Beginners and advance.pptx
Vinod Srivastava
 
FPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptxFPET_Implementation_2_MA to 360 Engage Direct.pptx
FPET_Implementation_2_MA to 360 Engage Direct.pptx
ssuser4ef83d
 
computer organization and assembly language.docx
computer organization and assembly language.docxcomputer organization and assembly language.docx
computer organization and assembly language.docx
alisoftwareengineer1
 
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
Safety Innovation in Mt. Vernon A Westchester County Model for New Rochelle a...
James Francis Paradigm Asset Management
 
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptxPerencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
Perencanaan Pengendalian-Proyek-Konstruksi-MS-PROJECT.pptx
PareaRusan
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...Thingyan is now a global treasure! See how people around the world are search...
Thingyan is now a global treasure! See how people around the world are search...
Pixellion
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Data Analytics Overview and its applications
Data Analytics Overview and its applicationsData Analytics Overview and its applications
Data Analytics Overview and its applications
JanmejayaMishra7
 
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbbEDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
EDU533 DEMO.pptxccccvbnjjkoo jhgggggbbbb
JessaMaeEvangelista2
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
Developing Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response ApplicationsDeveloping Security Orchestration, Automation, and Response Applications
Developing Security Orchestration, Automation, and Response Applications
VICTOR MAESTRE RAMIREZ
 

Fine Tuning and Enhancing Performance of Apache Spark Jobs

  • 2. Fine Tuning and Enhancing Performance of Apache Spark Jobs Blake Becerra, Kira Lindke, Kaushik Tadikonda
  • 3. Our Setup ▪ Data Validation Tool for ETL ▪ Millions of comparisons and aggregations ▪ One of the larger datasets initially took 4+ hours, unstable ▪ Challenge: improve reliability and performance ▪ Months of research and tuning, same application takes 35 minutes
  • 4. Configuring Cluster ▪ Test changes with ▪ spark.driver.cores ▪ spark.driver.memory ▪ executor-memory ▪ executor-cores ▪ spark.cores.max ▪ reserve cores (~1 core per node) for daemons ▪ num_executors = total_cores / num_cores ▪ num_partitions ▪ Too much memory per executor can result in excessive GC delays ▪ Too little memory can lose benefits from running multiple tasks in a single JVM ▪ Look at stats (network, CPU, memory, etc. and tweak to improve performance)
  • 5. Skew ▪ Can severely downgrade performance ▪ Extreme imbalance of work in the cluster ▪ Tasks within a stage (like a join) take uneven amounts of time to finish ▪ How to check ▪ Spark UI shows job waiting only on some of the tasks ▪ Look for large variances in memory usage within a job (primarily works if it is at the beginning of the job and doing data ingestion – otherwise can be misleading) ▪ Executor missing heartbeat ▪ Check partition sizes of RDD while debugging to confirm
  • 7. Example – skew in ingestion of data Uneven partitions Even partitions
  • 8. Handling skew cont’d ▪ Blind repartition is the most naïve approach but effective ▪ Great for "narrow transformations" ▪ Good for increasing the number of partitions ▪ Use coalesce not repartition to decrease partitions ▪ Choosing a better approach for joins specifically coming later
  • 9. Handling skew - ingestion ▪ Use spark options to do partitioned reads with JDBC ▪ partitionColumn ▪ lower/upperBound - used to determine stride ▪ numPartitions – maximum number of partitions ▪ partitionColumn ▪ Ideally is not skewed (such as primary key) ▪ Stride can have skew ▪ Possible trick – mod function ▪ Example of working with slow jdbc databases: ▪ Initial query took ~40 minutes ▪ Took it down to 10 minutes
  • 10. Example lowerBound: 0 upperBound: 1000 numPartitions: 10 => Stride is equal to 100 and partitions correspond to following queries: SELECT * FROM tableName WHERE partitionModColumn < 100 SELECT * FROM tableName WHERE partitionModColumn >= 100 AND partitionModColumn < 200 ... SELECT * FROM tableName WHERE partitionModColumn >= 900
  • 11. Also works with: partitioned data ingestion Example – reading from object storage, files, etc. ▪ Reads in with same skew ▪ Read in then repartition to get even distribution
  • 12. Cache/Persist ▪ Reuse a DataFrame with transformations ▪ Unpersist when done ▪ Without cache, DataFrame is built from scratch each time ▪ Don't over persist ▪ Worse memory performance ▪ Possible slowdown ▪ GC pressure
  • 13. Other Performance Improvements ▪ Try seq.par.foreach instead of just seq.foreach ▪ Increases parallelization ▪ Race conditions and non-deterministic results ▪ Use accumulators or synchronization to protect ▪ Avoid UDFs if possible ▪ Deserialize every row to object ▪ Apply lambda ▪ Then reserialize it ▪ More garbage generated
  • 14. Join Optimization ▪ Different join strategies ▪ Data preprocessing - clever filtering ▪ Avoiding unnecessary scans ▪ Locality - use of same partitioner ▪ Salting - skewed join keys
  • 18. Things to remember ▪ Follow good partitioning strategies ▪ Too few partitions – less parallelism ▪ Too many partitions – scheduling issues ▪ Improve scan efficiency ▪ Try to use same partitioner between DataFrames for join ▪ Skipped stages are good ▪ Caching prevents repeated exchanges where data is re-used
  • 19. Fair Scheduling ▪ Round Robin fashion ▪ Multiple jobs can run simultaneously if submitted by threads inside an application ▪ Better resource utilization ▪ Harder to debug and makes performance tuning more difficult
  • 20. Fair Scheduling and Pools ▪ Enabled by setting spark.scheduler.mode to FAIR ▪ Allows jobs to be grouped into pools with prioritization options ▪ Jobs created from threads without a pool defined always go to the “default” pool ▪ Pools, weight and minShare are specified in fairscheduler.xml
  • 22. Serialization ▪ Default for most types ▪ Can work with any class ▪ More flexible ▪ Slower ▪ Default for shuffling RDDs with simple types ▪ Significantly faster and more compact ▪ Set your SparkConf to use "org.apache.spark.serializer.KryoSe rializer" ▪ Register classes in order to take full advantage Kryo SerializerJava Serializer https://spark.apache.org/docs/latest/tuning.html#data-serialization
  • 23. CompressedOOPs ▪ JVM flag –XX:UseCompressedOops allows usage of 4-byte pointers instead of 8-byte ▪ Does not work as heap size grows beyond 32 GB ▪ Need 48 GB heap without compressed pointers to hold same number of objects as 32 GB
  • 24. Garbage Collection Tuning ▪ Contention when allocating more objects ▪ Full GC occurring before tasks finish ▪ Too many RDDs cached ▪ Frequent garbage collection ▪ Long time spent in GC ▪ Missed heartbeats
  • 27. ParallelGC (default) Heap space is divided into Young and Old ▪ Minor GC occurs frequently: ▪ Increase Eden and Survivor space ▪ Major GC occurs frequently: ▪ Increase young space if too many objects are being promoted ▪ Increase old space ▪ Decrease spark.memory.fraction ▪ Full GC occurs before task finishes: ▪ Try clearing young space by frequent GC triggers ▪ Increase memory
  • 28. G1GC ▪ Breaks heap into thousands of regions ▪ Recommended if heap sizes are large (> 8GB) and GC times are long ▪ Lower latency traded with higher CPU usage for GC bookkeeping For large scale applications, these properties help preventing full collections: ▪ -XX:ParallelGCThreads = n ▪ -XX:ConcGCThreads = [n, 2n] ▪ -XX:InitiatingHeapOccupancyPercent = 35
  • 30. Takeaways ▪ Performance tuning is iterative ▪ Tuning is case by case ▪ Take advantage of the Spark UI, logs, and available monitoring ▪ Focus on the major slowdowns, not on one particular trick ▪ You can't be perfect
  • 31. Thank you for your time! Questions