0% found this document useful (0 votes)

39 views

Pyspark Cashing & Persisting - Complete Guide

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Pyspark Cashing & Persisting - Complete Guide

Uploaded by

Richard Smith

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

In PySpark, both caching and persisting are strategies to improve the performance of your Spark jobs by

storing intermediate results in memory or disk. Understanding the difference between caching and persisting
is important for optimizing the performance of applications that involve heavy data transformations and
iterative computations.

1. Caching in PySpark #
Caching is a way to store a DataFrame (or RDD) in memory for future operations. By default, when you
cache a DataFrame or RDD, Spark stores it in the memory of executors. If the dataset cannot fit in memory,
Spark recomputes the remaining partitions from the original source when required.

Characteristics of Caching: #

Stores the data only in memory by default (MEMORY_ONLY).

Data stored in memory can be retrieved much faster, improving job performance for iterative
algorithms.
Suitable for smaller datasets or computations that involve multiple transformations on the same
DataFrame.

How to Use Caching: #

from pyspark.sql import SparkSession

# Initialize Spark session

spark = SparkSession.builder \
.appName("Caching Example") \
.getOrCreate()

# Sample data
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie"), (4, "David")]

# Create a DataFrame
df = spark.createDataFrame(data, ["id", "name"])

# Perform transformations
df_transformed = df.withColumn("id_squared", df["id"] ** 2)

# Cache the DataFrame

df_transformed.cache()

# Action to trigger caching

df_transformed.show()

2. Persisting in PySpark #

Persisting is more flexible than caching because it allows you to store data in various storage levels,
including both memory and disk. Unlike caching, where data is stored in memory only by default, persisting
lets you specify different storage levels, such as:

MEMORY_ONLY: Stores the RDD/DataFrame in memory only.

MEMORY_AND_DISK: Stores data in memory, but spills it to disk if memory is insufficient.
DISK_ONLY: Stores data only on disk.
MEMORY_ONLY_SER: Similar to MEMORY_ONLY, but serialized (reduces memory usage but increases
CPU overhead).
MEMORY_AND_DISK_SER: Serialized format, stores in memory, spills to disk if necessary.

Characteristics of Persisting: #

You can control the storage level more granularly compared to caching.
Suitable for larger datasets or situations where memory might be limited.
Provides fault tolerance by spilling data to disk when memory is insufficient.

How to Use Persisting: #

from pyspark import StorageLevel

# Persist the DataFrame with MEMORY_AND_DISK storage level

df_transformed.persist(StorageLevel.MEMORY_AND_DISK)

# Action to trigger persisting

df_transformed.show()

3. Comparing Caching and Persisting #

Feature Caching Persisting

Stores data in memory No default, user can choose storage
Default Behavior
(MEMORY_ONLY) level
Less flexible, only memory by More flexible (memory, disk,
Storage Flexibility
default serialization)
Usage Recommended for smaller datasets Recommended for large datasets
Performance Slightly slower if disk or serialization
Fastest when data fits in memory
Impact is used
Limited (recomputation for spilled Provides fault tolerance when using
Fault Tolerance
data) disk

4. Storage Levels in Detail #

Here are the storage levels available for persisting:

1. MEMORY_ONLY:
Stores data in memory. If it does not fit, recomputes the remaining partitions.
Use Case: Suitable for small datasets that can fit into memory.
pythonCopy codedf_transformed.persist(StorageLevel.MEMORY_ONLY)
2. MEMORY_AND_DISK:
Stores data in memory but spills it to disk if there is insufficient memory.
Use Case: Ideal for datasets that may not entirely fit in memory.
pythonCopy codedf_transformed.persist(StorageLevel.MEMORY_AND_DISK)
3. DISK_ONLY:
Stores data on disk only. This storage level is slower but useful when memory is a constraint.
Use Case: Suitable for large datasets where memory is limited.
pythonCopy codedf_transformed.persist(StorageLevel.DISK_ONLY)
4. MEMORY_ONLY_SER:
Stores the data in memory in serialized form, reducing memory consumption at the cost of
additional CPU usage.
Use Case: Good for memory-limited scenarios where the overhead of serialization is acceptable.
pythonCopy codedf_transformed.persist(StorageLevel.MEMORY_ONLY_SER)
5. MEMORY_AND_DISK_SER:
Similar to MEMORY_AND_DISK, but stores data in serialized format to save memory.
Use Case: Suitable for datasets that are large and may not fit in memory in their raw form.
pythonCopy codedf_transformed.persist(StorageLevel.MEMORY_AND_DISK_SER)

5. Code Example: Caching vs Persisting #

from pyspark import StorageLevel
from pyspark.sql import SparkSession
# Initialize Spark session
spark = SparkSession.builder \
.appName("Caching vs Persisting Example") \
.getOrCreate()

# Sample data
data = [(1, "Alice"), (2, "Bob"), (3, "Charlie"), (4, "David")]

# Create a DataFrame
df = spark.createDataFrame(data, ["id", "name"])

# Perform a transformation
df_transformed = df.withColumn("id_squared", df["id"] ** 2)

# Cache the DataFrame

df_transformed.cache()

# Persist the DataFrame with MEMORY_AND_DISK storage level

df_transformed.persist(StorageLevel.MEMORY_AND_DISK)

# Action to trigger caching or persisting

df_transformed.show()

# Count the number of rows to further trigger cached/persisted data

print(f"Row count: {df_transformed.count()}")

# Unpersist the DataFrame when done to free resources

df_transformed.unpersist()

6. When to Use Caching vs Persisting #

Caching is ideal when:
Your dataset is small enough to fit entirely in memory.
You need quick access to the data and want to avoid recomputing transformations.
Your application involves iterative algorithms like machine learning or graph processing.
Persisting is ideal when:
Your dataset is large and cannot fit in memory.
You need more control over how data is stored (e.g., disk, memory, or a combination).
You want to ensure fault tolerance, especially in long-running jobs.

7. Best Practices for Caching and Persisting #

Monitor Memory Usage: Use Spark’s web UI to monitor how much memory your job is using and
adjust caching/persisting accordingly.
Unpersist Data: Always unpersist cached or persisted DataFrames once you’re done with them to free
up resources.pythonCopy codedf_transformed.unpersist()
Use Serialization with Large Datasets: If you are working with large datasets, consider using
MEMORY_ONLY_SER or MEMORY_AND_DISK_SER to reduce memory usage.
Use Caching for Iterative Workloads: Caching is a good choice when you perform multiple actions
on the same DataFrame, as it avoids recomputing transformations repeatedly.

Conclusion #
Caching and persisting are two important strategies for optimizing Spark applications, especially when
dealing with large datasets and repeated transformations. While caching is simpler and faster, persisting
provides more flexibility and fault tolerance. Choosing between them depends on the size of your data,
memory constraints, and the specific needs of your application.

Azure Data Engineering Course
No ratings yet
Azure Data Engineering Course
20 pages
Pyspark 30 Days
No ratings yet
Pyspark 30 Days
32 pages
Guide To Building AI Agents From Scratch
100% (4)
Guide To Building AI Agents From Scratch
17 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Microsoft Certified Azure Administrator Associate
No ratings yet
Microsoft Certified Azure Administrator Associate
12 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Spark
No ratings yet
Spark
96 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
SCD in Databricks
No ratings yet
SCD in Databricks
16 pages
Complete Guide To Spark Memory Management 1726709042
No ratings yet
Complete Guide To Spark Memory Management 1726709042
11 pages
Pyspark - DataFrame Window Functions
No ratings yet
Pyspark - DataFrame Window Functions
3 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
Azure SQL Trainings: Contact: +91 90 32 82 44 67
No ratings yet
Azure SQL Trainings: Contact: +91 90 32 82 44 67
6 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
No ratings yet
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark
51 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
6.3. data_structure_pyspark.ipynb - Exercise
No ratings yet
6.3. data_structure_pyspark.ipynb - Exercise
6 pages
Databricksmcqsquestionsandanswers
No ratings yet
Databricksmcqsquestionsandanswers
5 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
TF On Spark
No ratings yet
TF On Spark
35 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
13 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Big Data Masters Certification Learnbay
No ratings yet
Big Data Masters Certification Learnbay
12 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
ENGG1003_10_PythonApplicationsOnJupiter
No ratings yet
ENGG1003_10_PythonApplicationsOnJupiter
30 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
No ratings yet
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
5 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
batch arch
No ratings yet
batch arch
1 page
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Pyspark
No ratings yet
Pyspark
31 pages
Spark Use Cases
No ratings yet
Spark Use Cases
2 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
PySpark FP Course ID 58339
No ratings yet
PySpark FP Course ID 58339
44 pages
PracticeExam DCADAS3 Scala 1
No ratings yet
PracticeExam DCADAS3 Scala 1
27 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Databricks
No ratings yet
Databricks
11 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Stream Processing at Lyft
No ratings yet
Stream Processing at Lyft
20 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Data Modelling
No ratings yet
Data Modelling
40 pages
Day 89
No ratings yet
Day 89
9 pages
Master Airflow With This Amazing Document!
No ratings yet
Master Airflow With This Amazing Document!
63 pages
20+ Key Difference in Spark
No ratings yet
20+ Key Difference in Spark
9 pages
spark - groupByKey vs reduceByKey
No ratings yet
spark - groupByKey vs reduceByKey
3 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Azure DE Roadmap2024
No ratings yet
Azure DE Roadmap2024
10 pages
Prompting Techniques
100% (1)
Prompting Techniques
14 pages
?????? ???????? ??????????
No ratings yet
?????? ???????? ??????????
5 pages
PySpark 30 Days Practice Guide?
No ratings yet
PySpark 30 Days Practice Guide?
35 pages
Python Portfolio Project For Data Analyst
No ratings yet
Python Portfolio Project For Data Analyst
13 pages
File Types in Data Engineering!
No ratings yet
File Types in Data Engineering!
18 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Step-By-Step Method To Find Drop Off Points in A User Flow
No ratings yet
Step-By-Step Method To Find Drop Off Points in A User Flow
17 pages
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
No ratings yet
Discover India's Path To Net-Zero - Sustainable Growth & Green Energy!
1 page
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Full Load
No ratings yet
Full Load
16 pages
New Mass Client
100% (1)
New Mass Client
11 pages
ZXA10 C600 Datasheet
No ratings yet
ZXA10 C600 Datasheet
6 pages
CISC 341 L03b - Interprocess Communication
No ratings yet
CISC 341 L03b - Interprocess Communication
6 pages
HP Proliant, Oneview and 3par Deployment Guide With Microsoft Hyper-V and SCVMM 2012 R2
No ratings yet
HP Proliant, Oneview and 3par Deployment Guide With Microsoft Hyper-V and SCVMM 2012 R2
44 pages
HL-2040 Serrvice Manual 02
No ratings yet
HL-2040 Serrvice Manual 02
10 pages
Fourth Generation Computers
No ratings yet
Fourth Generation Computers
13 pages
KINO-QM770 UMN v1.02
No ratings yet
KINO-QM770 UMN v1.02
164 pages
Network Requirements and Worksheets
No ratings yet
Network Requirements and Worksheets
30 pages
Log
No ratings yet
Log
12 pages
SnapMirror ActiveSync
No ratings yet
SnapMirror ActiveSync
2 pages
HP Integrity Rx3600 Rx6600 Servers
No ratings yet
HP Integrity Rx3600 Rx6600 Servers
2 pages
Av Security
No ratings yet
Av Security
86 pages
Microsoft Office 2003 Setup (0001)
No ratings yet
Microsoft Office 2003 Setup (0001)
2 pages
Purchasing A Computer Project PT
No ratings yet
Purchasing A Computer Project PT
6 pages
ZTE UMTS Congestion Control Feature Guide - V1 10
No ratings yet
ZTE UMTS Congestion Control Feature Guide - V1 10
47 pages
Certification Authorities Software Team (CAST) : Position Paper CAST-32A
No ratings yet
Certification Authorities Software Team (CAST) : Position Paper CAST-32A
23 pages
Linux Lab Manual
No ratings yet
Linux Lab Manual
57 pages
How To Make Your Computer Faster Using Regedit
No ratings yet
How To Make Your Computer Faster Using Regedit
10 pages
AG Dovecot IMAP-POP3 Server Configuration
No ratings yet
AG Dovecot IMAP-POP3 Server Configuration
3 pages
Ultra Bay 4 Portable
No ratings yet
Ultra Bay 4 Portable
1 page
Lab 2-CA CODE
No ratings yet
Lab 2-CA CODE
3 pages
03ds 1645 Ekinops
No ratings yet
03ds 1645 Ekinops
4 pages
EG8145V5 Datasheet
No ratings yet
EG8145V5 Datasheet
2 pages
How To Install Ubuntu in Oracle VM VirtualBox
100% (3)
How To Install Ubuntu in Oracle VM VirtualBox
30 pages
SPC 2401 Distributed Systems Year III And IV Semester I
No ratings yet
SPC 2401 Distributed Systems Year III And IV Semester I
2 pages
LinkingAbaqus2024InteloneAPIFortranCompilerVisualStudioinWindows
No ratings yet
LinkingAbaqus2024InteloneAPIFortranCompilerVisualStudioinWindows
15 pages
DHCP Server Migration, How It's Done
No ratings yet
DHCP Server Migration, How It's Done
2 pages
Lab 01 8085-1-1
No ratings yet
Lab 01 8085-1-1
8 pages
GigaVUE ReleaseNotes v51004
No ratings yet
GigaVUE ReleaseNotes v51004
45 pages

Pyspark Cashing & Persisting - Complete Guide

Uploaded by

Pyspark Cashing & Persisting - Complete Guide

Uploaded by

In PySpark, both caching and persisting are strategies to improve the performance of your Spark jobs by

Stores the data only in memory by default (MEMORY_ONLY).

How to Use Caching: #

from pyspark.sql import SparkSession

# Initialize Spark session

# Cache the DataFrame

# Action to trigger caching

MEMORY_ONLY: Stores the RDD/DataFrame in memory only.

How to Use Persisting: #

from pyspark import StorageLevel

# Persist the DataFrame with MEMORY_AND_DISK storage level

# Action to trigger persisting

3. Comparing Caching and Persisting #

Feature Caching Persisting

4. Storage Levels in Detail #

5. Code Example: Caching vs Persisting #

# Cache the DataFrame

# Persist the DataFrame with MEMORY_AND_DISK storage level

# Action to trigger caching or persisting

# Count the number of rows to further trigger cached/persisted data

# Unpersist the DataFrame when done to free resources

6. When to Use Caching vs Persisting #

7. Best Practices for Caching and Persisting #

You might also like