SlideShare a Scribd company logo
Optimizations in
Apache Spark
Presented By: Sarfaraz Hussain
Software Consultant
Knoldus Inc.
About Knoldus
Knoldus is a technology consulting firm with focus on modernizing the digital systems
at the pace your business demands.
DevOps
Functional. Reactive. Cloud Native
01 Spark Execution Model
02 Optimizing Shuffle Operations
03 Optimizing Functions
04 SQL vs RDD
05 Logical & Physical Plan
Agenda
06 Optimizing Joins
Apple
Banana
Orange
Apple
Cat
Dog
Cow
Orange
Cow
Banana
RDD
Optimizations in Spark; RDD, DataFrame
● Two kinds of operations:
1. Transformation
2. Action
● Dependency are divided into two types:
1. Narrow Dependency
2. Wide Dependency
● Stages
Spark Execution Model
Optimizations in Spark; RDD, DataFrame
DAG
Stage Details
Narrow Transformation Wide Transformation
map cogroup
mapValues groupWith
flatMap join
filter leftOuterJoin
mapPartitions rightOuterJoin
groupByKey
reduceByKey
combineByKey
distinct
intersection
repartition
coalesce
Shuffle Operations
What is Shuffle?
- Shuffles are data transfers between different executors of a Spark cluster.
Shuffle Operations
1. In which executors the data needs to be sent?
2. How to send the data?
GroupByKey
Shuffle Operations
Where to send data?
- Partitioner - The partitioner defines how records will be distributed and thus which records
will be completed by each task
Partitioner
Types of partitioner:
- Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as:
partition = key.hashCode() % numPartitions.
- Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples
with the same range will be on the same machine. This method is suitable where
there’s a natural ordering in the keys and the keys are non negative.
Example:
Hash Partitioner - GroupByKey, ReduceByKey
Range Partitioner - SortByKey
Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
Partitioner
Co-partitioned RDD
RDDs are co-partitioned if they are partitioned by the same partitioner.
Co-partitioned RDD
Co-located RDD
Partitions are co-located if they are both loaded into the memory of the same machine
(executor).
Shuffle Operations
How to send data?
- Serialization - It a mechanism of representing an object as a stream of byte,
transferring it through the network, and then reconstructing the same object, and its
state on another computer.
Serializer in Spark
- Types of Serializer in Spark -
- Java : slow, but robust
- Kryo : fast, but has few problem
Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
Optimizing Functions In Transformation
Optimizing Functions
Optimizing Functions
map vs mapPartitions
- Map works the function being utilized at a per element level while mapPartitions
exercises the function at the partition level.
- map: Applies a transformation function on each item of the RDD and returns the result
as a new RDD.
- mapPartition: It is called only once for each partition. The entire content of the
respective partitions is available as a sequential stream of values via the input
argument (Iterarator[T]).
- https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
map vs mapPartitions
SQL vs RDD
SQL RDD
SQL is high-level. RDD is low-level API.
SQL focus on “WHAT”. RDD focus on “HOW”.
Spark takes care of optimizing most SQL
queries.
Optimizing RDD is developer’s responsibility.
SQL are Declarative. RDD are Imperative i.e. we need to specify each
step of computation.
SQL knows about your data. RDD doesn’t know anything about your data.
Does not involves much
serialization/deserialization as Catalyst
Optimizer takes care to optimize it.
RDD involves too many
serialization/deserialization
SQL
SQL
RDD
RDD
Logical & Physical Plan
● Logical Plan
- Unresolved Logical Plan OR Parsed Logical Plan
- Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan
- Optimized Logical Plan
● Catalog
● Catalyst Optimizer
● Tungsten
● Physical Plan
Logical & Physical Plan
https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
Catalyst Optimizer and Tungsten
Codegen
Once the Best Physical Plan is selected, it’s the time to generate the executable
code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed
fashion. This process is called Codegen and that’s the job of Spark’s Tungsten
Execution Engine.
Let’s see them in action!
Unresolved Logical Plan
Resolved Logical Plan
Optimized Logical Plan
Physical Plan
Optimizing Joins
Types of Joins -
a. Shuffle hash Join
b. Sort-merge Join
c. Broadcast Join
Shuffle hash Join
- When join keys are not sortable.
- It is used when Sort-merge Join is disabled.
- spark.sql.join.preferSortMergeJoin is false.
- One side is much smaller (at least 3 times) than the other.
- Can build hash map.
Sort-merge Join
- spark.sql.join.preferSortMergeJoin is true by default.
- Default Join implementation.
- Join keys must be sortable.
- In our previous example, Sort-merge Join took place.
- Use Bucketing : Pre shuffle + sort data based on join key
Bucketing
- Bucketing helps to pre-compute the shuffle and store the data as input table, thus
avoiding shuffle at each stage.
- SET spark.sql.sources.bucketing.enabled = TRUE
Broadcast Join
- Broadcast smaller Dataframe to all Worker Node.
- Perform map-side join.
- No shuffle operations take places.
- spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
Broadcast Join
Broadcast Join
Broadcast Join
Broadcast Join
Sort-merge Join
Caching/Persisting
a. It keeps the lineage intact.
b. Data is cached to Executor’s memory and is fetched from the cache.
c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not
recomputed. (Done by the Resource Manager)
d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached.
e. The cache is cleared after the SparkContext is destroyed.
f. Persisting is unreliable.
g. data.persist() OR data.cache()
Checkpointing
a. It breaks the lineage.
b. Data is written and fetched from HDFS or local file system.
c. Data can not be recomputed from scratch if some partitions are lost
as the lineage chain is completely lost.
d. Checkpointed data can be used in subsequent job run.
e. Checkpointed data is persistent and not removed after SparkContext is
destroyed.
f. Checkpointing is reliable.
Checkpointing
spark.sparkContext.setCheckpointDir("/hdfs_directory/")
myRdd.checkpoint()
df.rdd.checkpoint()
Why to make a checkpoint?
- Busy cluster.
- Expensive and long computations.
Thank You!
https://www.linkedin.com/in/sarf
araz-hussain-8123b4132/
sarfaraz.hussain@knoldus.com

More Related Content

What's hot (20)

PDF
Streaming SQL with Apache Calcite
Julian Hyde
 
PDF
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
PDF
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
PPTX
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
PDF
Memory Management in Apache Spark
Databricks
 
PDF
Top 5 mistakes when writing Spark applications
hadooparchbook
 
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
PPTX
Hive+Tez: A performance deep dive
t3rmin4t0r
 
PDF
Hive tuning
Michael Zhang
 
PPTX
Apache spark
TEJPAL GAUTAM
 
PPTX
Time-Series Apache HBase
HBaseCon
 
PDF
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
PDF
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
PDF
InfluxDB & Grafana
Pedro Salgado
 
PPTX
Programming in Spark using PySpark
Mostafa
 
PPTX
Transformations and actions a visual guide training
Spark Summit
 
Streaming SQL with Apache Calcite
Julian Hyde
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
Databricks
 
Spark SQL Tutorial | Spark Tutorial for Beginners | Apache Spark Training | E...
Edureka!
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Spark Summit
 
Hive + Tez: A Performance Deep Dive
DataWorks Summit
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
Memory Management in Apache Spark
Databricks
 
Top 5 mistakes when writing Spark applications
hadooparchbook
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Hive+Tez: A performance deep dive
t3rmin4t0r
 
Hive tuning
Michael Zhang
 
Apache spark
TEJPAL GAUTAM
 
Time-Series Apache HBase
HBaseCon
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Ververica
 
Apache Spark - Dataframes & Spark SQL - Part 1 | Big Data Hadoop Spark Tutori...
CloudxLab
 
InfluxDB & Grafana
Pedro Salgado
 
Programming in Spark using PySpark
Mostafa
 
Transformations and actions a visual guide training
Spark Summit
 

Similar to Optimizations in Spark; RDD, DataFrame (20)

DOCX
Quick Guide to Refresh Spark skills
Ravindra kumar
 
PDF
Deep Dive into Spark
Eric Xiao
 
PPTX
Dive into spark2
Gal Marder
 
PDF
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
PDF
Spark Performance Tuning .pdf
Amit Raj
 
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
PDF
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
PDF
Meetup talk
Arpit Tak
 
PDF
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
PDF
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
PDF
Spark + AI Summit recap jul16 2020
Guido Oswald
 
PDF
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
PPTX
Spark real world use cases and optimizations
Gal Marder
 
PDF
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
PPTX
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PDF
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
PDF
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
PDF
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
arianmutchpp
 
PDF
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
jwdzzocl1862
 
Quick Guide to Refresh Spark skills
Ravindra kumar
 
Deep Dive into Spark
Eric Xiao
 
Dive into spark2
Gal Marder
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
Chetan Khatri
 
Spark Performance Tuning .pdf
Amit Raj
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Meetup talk
Arpit Tak
 
Apache Spark Performance tuning and Best Practise
Knoldus Inc.
 
What’s New in the Upcoming Apache Spark 3.0
Databricks
 
Spark + AI Summit recap jul16 2020
Guido Oswald
 
No more struggles with Apache Spark workloads in production
Chetan Khatri
 
Spark real world use cases and optimizations
Gal Marder
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
Chetan Khatri
 
IBM Spark Meetup - RDD & Spark Basics
Satya Narayan
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Apache Spark 3.0: Overview of What’s New and Why Care
Databricks
 
Beyond shuffling - Scala Days Berlin 2016
Holden Karau
 
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
arianmutchpp
 
High Performance Spark Best Practices for Scaling and Optimizing Apache Spark...
jwdzzocl1862
 
Ad

More from Knoldus Inc. (20)

PPTX
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
PPTX
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
PPTX
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
PPTX
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
PPTX
Java 17 features and implementation.pptx
Knoldus Inc.
 
PPTX
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
PPTX
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
PPTX
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
PPTX
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
PPTX
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
PPTX
Intro to Azure Container App Presentation
Knoldus Inc.
 
PPTX
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
PPTX
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
PPTX
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
PPTX
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
PPTX
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
PPTX
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
PPTX
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Angular Hydration Presentation (FrontEnd)
Knoldus Inc.
 
Optimizing Test Execution: Heuristic Algorithm for Self-Healing
Knoldus Inc.
 
Self-Healing Test Automation Framework - Healenium
Knoldus Inc.
 
Kanban Metrics Presentation (Project Management)
Knoldus Inc.
 
Java 17 features and implementation.pptx
Knoldus Inc.
 
Chaos Mesh Introducing Chaos in Kubernetes
Knoldus Inc.
 
GraalVM - A Step Ahead of JVM Presentation
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
Nomad by HashiCorp Presentation (DevOps)
Knoldus Inc.
 
DAPR - Distributed Application Runtime Presentation
Knoldus Inc.
 
Introduction to Azure Virtual WAN Presentation
Knoldus Inc.
 
Introduction to Argo Rollouts Presentation
Knoldus Inc.
 
Intro to Azure Container App Presentation
Knoldus Inc.
 
Insights Unveiled Test Reporting and Observability Excellence
Knoldus Inc.
 
Introduction to Splunk Presentation (DevOps)
Knoldus Inc.
 
Code Camp - Data Profiling and Quality Analysis Framework
Knoldus Inc.
 
AWS: Messaging Services in AWS Presentation
Knoldus Inc.
 
Amazon Cognito: A Primer on Authentication and Authorization
Knoldus Inc.
 
ZIO Http A Functional Approach to Scalable and Type-Safe Web Development
Knoldus Inc.
 
Managing State & HTTP Requests In Ionic.
Knoldus Inc.
 
Ad

Recently uploaded (20)

PPTX
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
PDF
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
Staying Human in a Machine- Accelerated World
Catalin Jora
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
PDF
July Patch Tuesday
Ivanti
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
PDF
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PPTX
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
PDF
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
Building Search Using OpenSearch: Limitations and Workarounds
Sease
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Jak MŚP w Europie Środkowo-Wschodniej odnajdują się w świecie AI
dominikamizerska1
 
"AI Transformation: Directions and Challenges", Pavlo Shaternik
Fwdays
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
Staying Human in a Machine- Accelerated World
Catalin Jora
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Using FME to Develop Self-Service CAD Applications for a Major UK Police Force
Safe Software
 
July Patch Tuesday
Ivanti
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Newgen Beyond Frankenstein_Build vs Buy_Digital_version.pdf
darshakparmar
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
OpenID AuthZEN - Analyst Briefing July 2025
David Brossard
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transcript: New from BookNet Canada for 2025: BNC BiblioShare - Tech Forum 2025
BookNet Canada
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
AUTOMATION AND ROBOTICS IN PHARMA INDUSTRY.pptx
sameeraaabegumm
 
Achieving Consistent and Reliable AI Code Generation - Medusa AI
medusaaico
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 

Optimizations in Spark; RDD, DataFrame

  • 1. Optimizations in Apache Spark Presented By: Sarfaraz Hussain Software Consultant Knoldus Inc.
  • 2. About Knoldus Knoldus is a technology consulting firm with focus on modernizing the digital systems at the pace your business demands. DevOps Functional. Reactive. Cloud Native
  • 3. 01 Spark Execution Model 02 Optimizing Shuffle Operations 03 Optimizing Functions 04 SQL vs RDD 05 Logical & Physical Plan Agenda 06 Optimizing Joins
  • 6. ● Two kinds of operations: 1. Transformation 2. Action ● Dependency are divided into two types: 1. Narrow Dependency 2. Wide Dependency ● Stages Spark Execution Model
  • 8. DAG
  • 10. Narrow Transformation Wide Transformation map cogroup mapValues groupWith flatMap join filter leftOuterJoin mapPartitions rightOuterJoin groupByKey reduceByKey combineByKey distinct intersection repartition coalesce
  • 11. Shuffle Operations What is Shuffle? - Shuffles are data transfers between different executors of a Spark cluster.
  • 12. Shuffle Operations 1. In which executors the data needs to be sent? 2. How to send the data? GroupByKey
  • 13. Shuffle Operations Where to send data? - Partitioner - The partitioner defines how records will be distributed and thus which records will be completed by each task
  • 14. Partitioner Types of partitioner: - Hash Partitioner: Uses Java’s Object.hashCode method to determine the partition as: partition = key.hashCode() % numPartitions. - Range Partitioner: It partition data either based on set of sorted ranges of keys, tuples with the same range will be on the same machine. This method is suitable where there’s a natural ordering in the keys and the keys are non negative. Example: Hash Partitioner - GroupByKey, ReduceByKey Range Partitioner - SortByKey Further reading: https://www.edureka.co/blog/demystifying-partitioning-in-spark
  • 16. Co-partitioned RDD RDDs are co-partitioned if they are partitioned by the same partitioner.
  • 18. Co-located RDD Partitions are co-located if they are both loaded into the memory of the same machine (executor).
  • 19. Shuffle Operations How to send data? - Serialization - It a mechanism of representing an object as a stream of byte, transferring it through the network, and then reconstructing the same object, and its state on another computer.
  • 20. Serializer in Spark - Types of Serializer in Spark - - Java : slow, but robust - Kryo : fast, but has few problem Further Reading: https://spark.apache.org/docs/latest/tuning.html#data-serialization
  • 21. Optimizing Functions In Transformation
  • 24. map vs mapPartitions - Map works the function being utilized at a per element level while mapPartitions exercises the function at the partition level. - map: Applies a transformation function on each item of the RDD and returns the result as a new RDD. - mapPartition: It is called only once for each partition. The entire content of the respective partitions is available as a sequential stream of values via the input argument (Iterarator[T]). - https://stackoverflow.com/questions/21185092/apache-spark-map-vs-mappartitions
  • 27. SQL RDD SQL is high-level. RDD is low-level API. SQL focus on “WHAT”. RDD focus on “HOW”. Spark takes care of optimizing most SQL queries. Optimizing RDD is developer’s responsibility. SQL are Declarative. RDD are Imperative i.e. we need to specify each step of computation. SQL knows about your data. RDD doesn’t know anything about your data. Does not involves much serialization/deserialization as Catalyst Optimizer takes care to optimize it. RDD involves too many serialization/deserialization
  • 28. SQL
  • 29. SQL
  • 30. RDD
  • 31. RDD
  • 32. Logical & Physical Plan ● Logical Plan - Unresolved Logical Plan OR Parsed Logical Plan - Resolved Logical Plan OR Logical Plan OR Analyzed Logical Plan - Optimized Logical Plan ● Catalog ● Catalyst Optimizer ● Tungsten ● Physical Plan
  • 33. Logical & Physical Plan https://blog.knoldus.com/understanding-sparks-logical-and-physical-plan-in-laymans-term/
  • 35. Codegen Once the Best Physical Plan is selected, it’s the time to generate the executable code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion. This process is called Codegen and that’s the job of Spark’s Tungsten Execution Engine.
  • 36. Let’s see them in action!
  • 41. Optimizing Joins Types of Joins - a. Shuffle hash Join b. Sort-merge Join c. Broadcast Join
  • 42. Shuffle hash Join - When join keys are not sortable. - It is used when Sort-merge Join is disabled. - spark.sql.join.preferSortMergeJoin is false. - One side is much smaller (at least 3 times) than the other. - Can build hash map.
  • 43. Sort-merge Join - spark.sql.join.preferSortMergeJoin is true by default. - Default Join implementation. - Join keys must be sortable. - In our previous example, Sort-merge Join took place. - Use Bucketing : Pre shuffle + sort data based on join key
  • 44. Bucketing - Bucketing helps to pre-compute the shuffle and store the data as input table, thus avoiding shuffle at each stage. - SET spark.sql.sources.bucketing.enabled = TRUE
  • 45. Broadcast Join - Broadcast smaller Dataframe to all Worker Node. - Perform map-side join. - No shuffle operations take places. - spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, 10485760)
  • 49. Caching/Persisting a. It keeps the lineage intact. b. Data is cached to Executor’s memory and is fetched from the cache. c. Data can be recomputed from scratch if some partitions are lost while cached partitions are not recomputed. (Done by the Resource Manager) d. Subsequent use of a RDD will not lead to computation beyond that point where it is cached. e. The cache is cleared after the SparkContext is destroyed. f. Persisting is unreliable. g. data.persist() OR data.cache()
  • 50. Checkpointing a. It breaks the lineage. b. Data is written and fetched from HDFS or local file system. c. Data can not be recomputed from scratch if some partitions are lost as the lineage chain is completely lost. d. Checkpointed data can be used in subsequent job run. e. Checkpointed data is persistent and not removed after SparkContext is destroyed. f. Checkpointing is reliable.