Spark Optimization Case Study Cleaned
Spark Optimization Case Study Cleaned
**Scenario**: You are joining a 1 TB customer transactions dataset with a small 100 MB customer
demographics dataset.
**Solution**: Use a **broadcast join** to avoid shuffling the large dataset. The smaller dataset
```python
large_df = spark.read.parquet("s3://large-transactions")
small_df = spark.read.parquet("s3://small-customer-demographics")
```
This ensures only the large dataset is distributed, saving shuffle time.
**Scenario**: You need to transform and aggregate a sales dataset. Instead of using
```python
sales_rdd = sc.parallelize(sales_data)
result = sales_rdd.reduceByKey(lambda x, y: x + y)
```
**Scenario**: You're working on a Spark job that processes 10TB of web logs. Instead of storing all
```python
df = spark.read.json("s3://large-logs/")
df.persist(StorageLevel.DISK_ONLY)
```
This ensures you don't run out of memory while processing large datasets.
**Scenario**: By default, Spark creates 200 partitions after shuffle. However, for large datasets (e.g.,
5 TB), 200 partitions may be too few, causing large partitions and high memory consumption.
```python
spark.conf.set("spark.sql.shuffle.partitions", "1000")
```
**Scenario**: Your Spark executors run out of memory when processing a large dataset.
```python
spark.conf.set("spark.driver.memory", "4g")
```
**Scenario**: You're processing sales data partitioned by region, but one region (`'North America'`)
```python
```
Adding a `salt` column randomizes the data, distributing it more evenly across partitions.
**Scenario**: Your dataset contains 100 GB of customer data partitioned by `year`. When querying
only recent data, Spark will push the filter down to only scan relevant partitions.
```python
df = spark.read.parquet("s3://customer-data/")
```
## 8. Bucketing Example:
**Scenario**: You're frequently joining two datasets on `customer_id`. Bucketing the datasets on this
```python
# Bucketing datasets on customer_id
df.write.bucketBy(10, "customer_id").saveAsTable("bucketed_customers")
```
**Scenario**: Partition the dataset by `year` to improve query performance on time-series data.
```python
# Partitioning by year
df.write.partitionBy("year").parquet("s3://data/transactions")
```
**Scenario**: The partition for the `North America` region is much larger than others. You decide to
```python
df.repartition("region", "sales_amount").write.parquet("s3://balanced-partitions")
```
---
## 1. Indexing in Redshift:
**Scenario**: You're running frequent queries on a Redshift table filtering by `customer_id`. Adding
```sql
sale_id BIGINT,
customer_id INT,
sale_amount DECIMAL(10,2),
sale_date DATE
```
## 2. Partitioning in Redshift:
**Scenario**: You're storing 10 years of sales data in Redshift and frequently query by date range.
```sql
sale_id BIGINT,
customer_id INT,
sale_amount DECIMAL(10,2),
sale_date DATE
DISTKEY (sale_date);
```
- **KEY Distribution**: Distributes data based on the values of a specific column (like
`customer_id`).
- **ALL Distribution**: A full copy of the table is stored on every node (useful for small, frequently
joined tables).
## 3. Indexing in Postgres:
**Scenario**: In Postgres, you frequently run queries filtering by `email`. Adding an index on the
```sql
```
## 4. Partitioning in Postgres:
**Scenario**: You have a large time-series table and want to improve query performance by
```sql
sale_id BIGINT,
sale_date DATE
```
5. **Handling Uneven Distribution**: In both Postgres and Redshift, uneven distribution can be
addressed using a distribution key based on data access patterns.
---
- **Partitioning**: Divides data based on specific keys (e.g., `date`, `region`) to improve query
- **Bucketing**: Hashes data into a fixed number of buckets based on a key to improve joins.
- **Indexing**: Improves query performance by creating quick lookup structures for frequently filtered
- **Skew Handling**: For uneven data distribution, use salting or repartitioning to balance load