DE Bootcamp _ Week 3 Day 2
DE Bootcamp _ Week 3 Day 2
DEVELOPMENT ENVIRONMENTS
Data Maintains data partition across the cluster. Copies the entire small dataset to all executors in the
distribution cluster.
Memory Data is distributed across all nodes (memory efficient). Highly efficient for small-large table joins (no shuffle).
management Handles large datasets as data is partitioned. Memory threshold default to 10Mb (can be either
changed manually or dataset wrapped into broadcast()).
Use cases Large datasets that need multiple transformations. Joining a large dataset with a small lookup table.
Partitioned data that needs to remain partitioned. Smaller dataset can fit into the executor memory (<2Gb).
Iterative tasks & repeated queries on same dataset. Network bandwidth is not a bottleneck for broadcasting.
Perform joins without shuffling.
JOIN
+ JOIN
+ JOIN + JOIN
copy partition 1 copy partition 2 copy partition 3 ... copy partition n
combined dataset
UDFs / UDAFs
UDF (User Defined Function): a way to implement custom column transformations & complex data processing logic
within Spark DataFrame operations
UDAF (User Defined Aggregation Function): more specialized than regular UDFs, designed specifically for aggregation
operations.
Zach’s recommendations
Recent Improvements: Primary recommendation
Apache Arrow integration improved PySpark UDF PySpark over Scala (more job opportunities)
performance. Scala if coming from Java background
Better alignment between PySpark & Scala Spark Language transition considerations
UDFs through optimization. Java to Scala: easier transition (both JVM languages)
Dataset API benefits: Python to Scala: tough transition (interpreted vs.
Enables pure Scala functional programming compiled languages)
Eliminates need for traditional UDFs Performance optimization tips
Provides type-safe operations Use Dataset API in Scala when possible
Better performance characteristics Leverage Apache Arrow optimizations in PySpark
Consider UDAFs for aggregation operations in Scala
Spark Tuning
Parquet: columnar file format, run-length encoding, Memory Strategy
benefits from parallel processing more efficient querying. 1. Executor: don’t arbitrarily set to 16Gb (configure based on
Run-length encoding benefits: workload requirements).
Leverages data recency & primary patterns. 2. Driver: bump only when needed.
Reduces cloud storage costs & network traffic, better
Shuffle Partitions:
compression ratios, eliminates cross-partition data
1. Default: 200 partitions (~100 Mb/partition).
movement
2. Optimized: test w/ multiple partitions (1000, 2000,
Sorting recommendations:
3000,...), measure performance metrics, incrementally
DO: .sortWithinPartitions (parallelizable operation,
adjust parameters. Check job type:
maintains good data distribution, sorts locally, cost
I/O heavy job? Focus on partition optimization
effective, no expensive shuffle operations).
Memory heavy job? Balance executor memory
DONT: Global.Sort() (very slow performance; resource
Network heavy job? Prioritize network topology &
intensive; creates additional shuffle step; all data
data locality
forced to one single executor; requires strict order of
maintenance; computationally expensive) AQE (Adaptative Query Execution)
Technical implementation Optimizes for skewed datasets.
1. Partition level processing Automatically adjusts query plans based on runtime stats
2. Performance optimization Don’t enable preemptively (let Spark handle skew
3. Storage efficiency optimization).