0% found this document useful (0 votes)

10 views4 pages

DE Bootcamp _ Week 3 Day 2

The document outlines advanced concepts in Apache Spark, focusing on development environments, caching, temporal views, and user-defined functions (UDFs). It compares different approaches for job submission and caching strategies, emphasizing the importance of memory management and performance optimization. Additionally, it discusses the differences between PySpark and ScalaSpark, as well as tuning recommendations for Spark applications.

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views4 pages

DE Bootcamp _ Week 3 Day 2

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2

APACHE SPARK: ADVANCED CONCEPTS 1/4

DEVELOPMENT ENVIRONMENTS

SPARK SERVER SPARK NOTEBOOKS

approach approach
Characteristics Characteristics
Job submission via CLI with Java class containing Interactive development environment.
run method. Persistent Spark session until explicit termination.
Fresh environment for each run (automatic Requires explicit unpersist() calls.
uncaching).
Technical considerations:
Similar to production environment.
Faster for development and prototyping.
Technical considerations: Cache behavior differs from production.
More accurate performance testing. Risk of memory leaks without proper cache.
Better resource cleanup. management.
Supports proper deployment workflows. Better for exploratory data analysis.
Enables CI/CD integration. Good for collaborative development.
Good for production-like testing.

CACHING & TEMPORAL VIEWS

CACHE
Temporary storage of computed data in memory and/or disk. Allows to reuse the same data w/o recomputing it.
Storage Levels
MEMORY ONLY DISK ONLY MEMORY & DISK
Features Features Features
Primary choice for most scenarios. Used when memory is insufficient. Hybrid approach.
Fastest access time. Slower than memory caching. Spills to disk when memory is full .
Limited by available RAM. Consider staging tables instead. Not always optimal for
Similar performance to performance.
materialized view.

Rule of Thumb for storage TEMPORARY VIEW

1. Memory caching when possible always. A named reference to DataFrame in Spark (‘virtual table’).
2. Use staging tables instead of disk caching. No physical storage (query recomputed each time).
3. For edge cases with large, frequently reused results Exists within a specific Spark session only (lost when
Staging tables over disk caching. session ends).
Break down jobs into manageable tasks Created using createTempView() or
(faster processing). createOrReplaceTempView().
Albert Campillo Repost
ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 2/4
CACHING VS. BROADCAST JOIN
CACHING BROADCAST JOIN
Purpose Stores pre-computed values for reuse across multiple Optimizes join operations for scenarios with one small &
operations. one large dataset.

Data Maintains data partition across the cluster. Copies the entire small dataset to all executors in the
distribution cluster.

Memory Data is distributed across all nodes (memory efficient). Highly efficient for small-large table joins (no shuffle).
management Handles large datasets as data is partitioned. Memory threshold default to 10Mb (can be either
changed manually or dataset wrapped into broadcast()).

Use cases Large datasets that need multiple transformations. Joining a large dataset with a small lookup table.
Partitioned data that needs to remain partitioned. Smaller dataset can fit into the executor memory (<2Gb).
Iterative tasks & repeated queries on same dataset. Network bandwidth is not a bottleneck for broadcasting.
Perform joins without shuffling.

Broadcast JOIN flow in

data loaded to data loaded to
Spark cluster Spark cluster
DRIVER
small dataset
large dataset

No small db is within Yes creates copies of small dataset

threshold? partitions large dataset into smaller datasubsets

WORKER NODE 1 WORKER NODE 2 WORKER NODE 3 WORKER NODE N

JOIN
+ JOIN
+ JOIN + JOIN
copy partition 1 copy partition 2 copy partition 3 ... copy partition n

‘large dataset’.join(broadcast (‘small dataset’), ‘ID’)

combined dataset

Albert Campillo Repost

ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 3/4

UDFs / UDAFs
UDF (User Defined Function): a way to implement custom column transformations & complex data processing logic
within Spark DataFrame operations
UDAF (User Defined Aggregation Function): more specialized than regular UDFs, designed specifically for aggregation
operations.

PySpark vs. ScalaSpark

PYSPARK SCALASPARK
Complex UDF workflow Considerations
Scala Data JVM serialization Python process Pros
Direct JVM execution.
Return to Scala Serialize result Execute UDF More performant UDAFs.
Native Dataset API access.
Considerations: Better performance (no serialization overhead).
Python-user friendly. Cons
Broader community/ ecosystem. Not free (commercial licensing).
Improved by Apache Arrow but still performance Steep learning curve.
issues.
Performance overhead due to serialization.

Zach’s recommendations
Recent Improvements: Primary recommendation
Apache Arrow integration improved PySpark UDF PySpark over Scala (more job opportunities)
performance. Scala if coming from Java background
Better alignment between PySpark & Scala Spark Language transition considerations
UDFs through optimization. Java to Scala: easier transition (both JVM languages)
Dataset API benefits: Python to Scala: tough transition (interpreted vs.
Enables pure Scala functional programming compiled languages)
Eliminates need for traditional UDFs Performance optimization tips
Provides type-safe operations Use Dataset API in Scala when possible
Better performance characteristics Leverage Apache Arrow optimizations in PySpark
Consider UDAFs for aggregation operations in Scala

Albert Campillo Repost

ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 4/4

The API continuum

SQL DataFrame Dataset

Ideal for multi-user collaboration Ideal for hardened PySpark Ideal for strong software
(data analysts/scientists). pipelines with minimal change engineering practices & enterprise
Fast prototyping/ requirements. solutions.
experimentation. Enables code modularization. Enhanced schema handling
Most flexible for rapid changes Better testeability with function Superior unit & integration testing
Lowest entry barrier (SQL only). separation. capabilities.
Simple null returns for null Flexibility-structure balanced. Easy mock data generation.
encounters (less strict nullability Simple null returns for null Built-in type safety.
controls). encounters (less strict nullability Explicit null handling: required
controls). nullability declarations (pipeline
fails if nullability rules violated).

Spark Tuning
Parquet: columnar file format, run-length encoding, Memory Strategy
benefits from parallel processing more efficient querying. 1. Executor: don’t arbitrarily set to 16Gb (configure based on
Run-length encoding benefits: workload requirements).
Leverages data recency & primary patterns. 2. Driver: bump only when needed.
Reduces cloud storage costs & network traffic, better
Shuffle Partitions:
compression ratios, eliminates cross-partition data
1. Default: 200 partitions (~100 Mb/partition).
movement
2. Optimized: test w/ multiple partitions (1000, 2000,
Sorting recommendations:
3000,...), measure performance metrics, incrementally
DO: .sortWithinPartitions (parallelizable operation,
adjust parameters. Check job type:
maintains good data distribution, sorts locally, cost
I/O heavy job? Focus on partition optimization
effective, no expensive shuffle operations).
Memory heavy job? Balance executor memory
DONT: Global.Sort() (very slow performance; resource
Network heavy job? Prioritize network topology &
intensive; creates additional shuffle step; all data
data locality
forced to one single executor; requires strict order of
maintenance; computationally expensive) AQE (Adaptative Query Execution)
Technical implementation Optimizes for skewed datasets.
1. Partition level processing Automatically adjusts query plans based on runtime stats
2. Performance optimization Don’t enable preemptively (let Spark handle skew
3. Storage efficiency optimization).

Albert Campillo Repost

Wire Rope Users Manual
100% (4)
Wire Rope Users Manual
132 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Guide To Building AI Agents From Scratch
100% (4)
Guide To Building AI Agents From Scratch
17 pages
GI 2.710 Latast PDF
50% (2)
GI 2.710 Latast PDF
35 pages
Flattop Kiln Complete
100% (8)
Flattop Kiln Complete
12 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
PySpark+Slides v1
No ratings yet
PySpark+Slides v1
458 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
No ratings yet
Big Data Processing With Apache Spark – Part 1_ Introduction - InfoQ
18 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
Your Paragraph Text
No ratings yet
Your Paragraph Text
26 pages
Spark Devops
0% (1)
Spark Devops
301 pages
Pyspark Modules&packages RDD
No ratings yet
Pyspark Modules&packages RDD
9 pages
Simplifying Data Engineering Databricks
100% (1)
Simplifying Data Engineering Databricks
20 pages
BDA-Lec9
No ratings yet
BDA-Lec9
25 pages
Spark Tutorial
No ratings yet
Spark Tutorial
77 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Real Time Analytics With Spark and Kafka
No ratings yet
Real Time Analytics With Spark and Kafka
53 pages
Apache Spark and Scala
No ratings yet
Apache Spark and Scala
53 pages
Thesis Apache Spark
100% (2)
Thesis Apache Spark
4 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
spark theory
No ratings yet
spark theory
26 pages
Day5 Patterns Use Cases
No ratings yet
Day5 Patterns Use Cases
45 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
DB For Data Engineering Solution Sheet
No ratings yet
DB For Data Engineering Solution Sheet
2 pages
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
No ratings yet
Deloitte Pyspark Interview Questions for Data Engineer 2024 _ by Ronit Malhotra _ Jun, 2024 _ Medium
9 pages
Module 4
No ratings yet
Module 4
29 pages
Spark Questions Imp
No ratings yet
Spark Questions Imp
33 pages
Databricks: Building and Operating A Big Data Service Based On Apache Spark
No ratings yet
Databricks: Building and Operating A Big Data Service Based On Apache Spark
32 pages
Spark Final Theory
No ratings yet
Spark Final Theory
19 pages
Spark 101
No ratings yet
Spark 101
25 pages
Spark-Rdd
No ratings yet
Spark-Rdd
15 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Unit 4
No ratings yet
Unit 4
60 pages
Spark: Prepared by Dulari Bhatt
No ratings yet
Spark: Prepared by Dulari Bhatt
19 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Pyspark_notes_new
No ratings yet
Pyspark_notes_new
18 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
4 pages
Python and Pyspark With Databricks, With Azure Project
No ratings yet
Python and Pyspark With Databricks, With Azure Project
9 pages
Key Features: General-Purpose Fast Cluster Computing Platform
No ratings yet
Key Features: General-Purpose Fast Cluster Computing Platform
16 pages
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
No ratings yet
CISD 42 Introduction to Spark_Spark Transformation_Spark Actions
27 pages
Pyspark Interview Code
100% (3)
Pyspark Interview Code
197 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Pyspark
No ratings yet
Pyspark
10 pages
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
No ratings yet
Using Spark On Cori: Lisa Gerhardt, Evan Racah NERSC New User Training
14 pages
Data Engineer Interview
No ratings yet
Data Engineer Interview
23 pages
Enterprise Data Storage and Analysis On Spark
No ratings yet
Enterprise Data Storage and Analysis On Spark
34 pages
Page 01
No ratings yet
Page 01
2 pages
Spark For Python Developers - Sample Chapter
100% (6)
Spark For Python Developers - Sample Chapter
32 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Spark
No ratings yet
Spark
49 pages
Apache Spark Essential Training
No ratings yet
Apache Spark Essential Training
30 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
A Brief Introduction To Apache Spark
No ratings yet
A Brief Introduction To Apache Spark
10 pages
Interview Questions
No ratings yet
Interview Questions
4 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Azure Data Engineer + Databricks Content
No ratings yet
Azure Data Engineer + Databricks Content
7 pages
Lecture 3 PPT 22
No ratings yet
Lecture 3 PPT 22
25 pages
Learning Spark Preview Ed
No ratings yet
Learning Spark Preview Ed
18 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Data Modelling
No ratings yet
Data Modelling
40 pages
Day 89
No ratings yet
Day 89
9 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
?????? ???????? ??????????
No ratings yet
?????? ???????? ??????????
5 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
spark - groupByKey vs reduceByKey
No ratings yet
spark - groupByKey vs reduceByKey
3 pages
Prompting Techniques
100% (2)
Prompting Techniques
14 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Full Load
No ratings yet
Full Load
16 pages
3 Tyre Types 160217043028
No ratings yet
3 Tyre Types 160217043028
23 pages
Activity 1
No ratings yet
Activity 1
3 pages
HBR Catalogue Eng KKPC
No ratings yet
HBR Catalogue Eng KKPC
2 pages
the-design-thinking-workbook-parikh-en-46474
No ratings yet
the-design-thinking-workbook-parikh-en-46474
6 pages
HMMT Geometry Round 2018
No ratings yet
HMMT Geometry Round 2018
3 pages
Determining The Specific Gravity of Soils: Test Procedure For
No ratings yet
Determining The Specific Gravity of Soils: Test Procedure For
4 pages
Metals Metalloids and Non Metals
No ratings yet
Metals Metalloids and Non Metals
10 pages
4.evidence On Relocation From Field Research
No ratings yet
4.evidence On Relocation From Field Research
9 pages
A. Description of The Class: Commands)
No ratings yet
A. Description of The Class: Commands)
3 pages
Traffic Working
No ratings yet
Traffic Working
141 pages
SN 74175
No ratings yet
SN 74175
7 pages
CV Daniel Agus Nugroho
No ratings yet
CV Daniel Agus Nugroho
5 pages
3000 FFL Ops & IPL 11-06-08
No ratings yet
3000 FFL Ops & IPL 11-06-08
140 pages
Effective Metal Shields of High Voltage
No ratings yet
Effective Metal Shields of High Voltage
12 pages
Alive in The Desert - The Complete Guide For Desert Recreation and Survival - Joe Kraus - Paladin Press
No ratings yet
Alive in The Desert - The Complete Guide For Desert Recreation and Survival - Joe Kraus - Paladin Press
123 pages
Costandache 2
No ratings yet
Costandache 2
2 pages
Submission Guideline_INCIRESMA 2025-15032025
No ratings yet
Submission Guideline_INCIRESMA 2025-15032025
9 pages
Victorian Forestry Contractors Information Booklet
No ratings yet
Victorian Forestry Contractors Information Booklet
66 pages
Emphatic Do, Does, Did An Other Auxiliaries
No ratings yet
Emphatic Do, Does, Did An Other Auxiliaries
9 pages
IGS Piping Guide
No ratings yet
IGS Piping Guide
61 pages
MIL Lesson Plan
No ratings yet
MIL Lesson Plan
3 pages
ICT-WEEK-1-2
No ratings yet
ICT-WEEK-1-2
3 pages
Socialization and Gender
100% (2)
Socialization and Gender
7 pages
An Introduction To Ferrography
No ratings yet
An Introduction To Ferrography
37 pages
Elementary Mathematics From An Advanced Standpoint Geometry
100% (1)
Elementary Mathematics From An Advanced Standpoint Geometry
235 pages
Optical Fiber WaveGuiding PDF
No ratings yet
Optical Fiber WaveGuiding PDF
51 pages

DE Bootcamp _ Week 3 Day 2

Uploaded by

DE Bootcamp _ Week 3 Day 2

Uploaded by

ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2

APACHE SPARK: ADVANCED CONCEPTS 1/4

SPARK SERVER SPARK NOTEBOOKS

CACHING & TEMPORAL VIEWS

Rule of Thumb for storage TEMPORARY VIEW

Broadcast JOIN flow in

No small db is within Yes creates copies of small dataset

WORKER NODE 1 WORKER NODE 2 WORKER NODE 3 WORKER NODE N

‘large dataset’.join(broadcast (‘small dataset’), ‘ID’)

Albert Campillo Repost

PySpark vs. ScalaSpark

Albert Campillo Repost

The API continuum

SQL DataFrame Dataset

Albert Campillo Repost

You might also like