0% found this document useful (0 votes)
10 views

DE Bootcamp _ Week 3 Day 2

The document outlines advanced concepts in Apache Spark, focusing on development environments, caching, temporal views, and user-defined functions (UDFs). It compares different approaches for job submission and caching strategies, emphasizing the importance of memory management and performance optimization. Additionally, it discusses the differences between PySpark and ScalaSpark, as well as tuning recommendations for Spark applications.

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

DE Bootcamp _ Week 3 Day 2

The document outlines advanced concepts in Apache Spark, focusing on development environments, caching, temporal views, and user-defined functions (UDFs). It compares different approaches for job submission and caching strategies, emphasizing the importance of memory management and performance optimization. Additionally, it discusses the differences between PySpark and ScalaSpark, as well as tuning recommendations for Spark applications.

Uploaded by

Richard Smith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2

APACHE SPARK: ADVANCED CONCEPTS 1/4

DEVELOPMENT ENVIRONMENTS

SPARK SERVER SPARK NOTEBOOKS


approach approach
Characteristics Characteristics
Job submission via CLI with Java class containing Interactive development environment.
run method. Persistent Spark session until explicit termination.
Fresh environment for each run (automatic Requires explicit unpersist() calls.
uncaching).
Technical considerations:
Similar to production environment.
Faster for development and prototyping.
Technical considerations: Cache behavior differs from production.
More accurate performance testing. Risk of memory leaks without proper cache.
Better resource cleanup. management.
Supports proper deployment workflows. Better for exploratory data analysis.
Enables CI/CD integration. Good for collaborative development.
Good for production-like testing.

CACHING & TEMPORAL VIEWS


CACHE
Temporary storage of computed data in memory and/or disk. Allows to reuse the same data w/o recomputing it.
Storage Levels
MEMORY ONLY DISK ONLY MEMORY & DISK
Features Features Features
Primary choice for most scenarios. Used when memory is insufficient. Hybrid approach.
Fastest access time. Slower than memory caching. Spills to disk when memory is full .
Limited by available RAM. Consider staging tables instead. Not always optimal for
Similar performance to performance.
materialized view.

Rule of Thumb for storage TEMPORARY VIEW


1. Memory caching when possible always. A named reference to DataFrame in Spark (‘virtual table’).
2. Use staging tables instead of disk caching. No physical storage (query recomputed each time).
3. For edge cases with large, frequently reused results Exists within a specific Spark session only (lost when
Staging tables over disk caching. session ends).
Break down jobs into manageable tasks Created using createTempView() or
(faster processing). createOrReplaceTempView().
Albert Campillo Repost
ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 2/4
CACHING VS. BROADCAST JOIN
CACHING BROADCAST JOIN
Purpose Stores pre-computed values for reuse across multiple Optimizes join operations for scenarios with one small &
operations. one large dataset.

Data Maintains data partition across the cluster. Copies the entire small dataset to all executors in the
distribution cluster.

Memory Data is distributed across all nodes (memory efficient). Highly efficient for small-large table joins (no shuffle).
management Handles large datasets as data is partitioned. Memory threshold default to 10Mb (can be either
changed manually or dataset wrapped into broadcast()).

Use cases Large datasets that need multiple transformations. Joining a large dataset with a small lookup table.
Partitioned data that needs to remain partitioned. Smaller dataset can fit into the executor memory (<2Gb).
Iterative tasks & repeated queries on same dataset. Network bandwidth is not a bottleneck for broadcasting.
Perform joins without shuffling.

Broadcast JOIN flow in


data loaded to data loaded to
Spark cluster Spark cluster
DRIVER
small dataset
large dataset

No small db is within Yes creates copies of small dataset


threshold? partitions large dataset into smaller datasubsets

WORKER NODE 1 WORKER NODE 2 WORKER NODE 3 WORKER NODE N

JOIN
+ JOIN
+ JOIN + JOIN
copy partition 1 copy partition 2 copy partition 3 ... copy partition n

‘large dataset’.join(broadcast (‘small dataset’), ‘ID’)

combined dataset

Albert Campillo Repost


ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 3/4

UDFs / UDAFs
UDF (User Defined Function): a way to implement custom column transformations & complex data processing logic
within Spark DataFrame operations
UDAF (User Defined Aggregation Function): more specialized than regular UDFs, designed specifically for aggregation
operations.

PySpark vs. ScalaSpark


PYSPARK SCALASPARK
Complex UDF workflow Considerations
Scala Data JVM serialization Python process Pros
Direct JVM execution.
Return to Scala Serialize result Execute UDF More performant UDAFs.
Native Dataset API access.
Considerations: Better performance (no serialization overhead).
Python-user friendly. Cons
Broader community/ ecosystem. Not free (commercial licensing).
Improved by Apache Arrow but still performance Steep learning curve.
issues.
Performance overhead due to serialization.

Zach’s recommendations
Recent Improvements: Primary recommendation
Apache Arrow integration improved PySpark UDF PySpark over Scala (more job opportunities)
performance. Scala if coming from Java background
Better alignment between PySpark & Scala Spark Language transition considerations
UDFs through optimization. Java to Scala: easier transition (both JVM languages)
Dataset API benefits: Python to Scala: tough transition (interpreted vs.
Enables pure Scala functional programming compiled languages)
Eliminates need for traditional UDFs Performance optimization tips
Provides type-safe operations Use Dataset API in Scala when possible
Better performance characteristics Leverage Apache Arrow optimizations in PySpark
Consider UDAFs for aggregation operations in Scala

Albert Campillo Repost


ZACH MORRIS’ DATA ENGINEERING BOOTCAMP WEEK 3 DAY 2
APACHE SPARK: ADVANCED CONCEPTS 4/4

The API continuum

SQL DataFrame Dataset


Ideal for multi-user collaboration Ideal for hardened PySpark Ideal for strong software
(data analysts/scientists). pipelines with minimal change engineering practices & enterprise
Fast prototyping/ requirements. solutions.
experimentation. Enables code modularization. Enhanced schema handling
Most flexible for rapid changes Better testeability with function Superior unit & integration testing
Lowest entry barrier (SQL only). separation. capabilities.
Simple null returns for null Flexibility-structure balanced. Easy mock data generation.
encounters (less strict nullability Simple null returns for null Built-in type safety.
controls). encounters (less strict nullability Explicit null handling: required
controls). nullability declarations (pipeline
fails if nullability rules violated).

Spark Tuning
Parquet: columnar file format, run-length encoding, Memory Strategy
benefits from parallel processing more efficient querying. 1. Executor: don’t arbitrarily set to 16Gb (configure based on
Run-length encoding benefits: workload requirements).
Leverages data recency & primary patterns. 2. Driver: bump only when needed.
Reduces cloud storage costs & network traffic, better
Shuffle Partitions:
compression ratios, eliminates cross-partition data
1. Default: 200 partitions (~100 Mb/partition).
movement
2. Optimized: test w/ multiple partitions (1000, 2000,
Sorting recommendations:
3000,...), measure performance metrics, incrementally
DO: .sortWithinPartitions (parallelizable operation,
adjust parameters. Check job type:
maintains good data distribution, sorts locally, cost
I/O heavy job? Focus on partition optimization
effective, no expensive shuffle operations).
Memory heavy job? Balance executor memory
DONT: Global.Sort() (very slow performance; resource
Network heavy job? Prioritize network topology &
intensive; creates additional shuffle step; all data
data locality
forced to one single executor; requires strict order of
maintenance; computationally expensive) AQE (Adaptative Query Execution)
Technical implementation Optimizes for skewed datasets.
1. Partition level processing Automatically adjusts query plans based on runtime stats
2. Performance optimization Don’t enable preemptively (let Spark handle skew
3. Storage efficiency optimization).

Albert Campillo Repost

You might also like