PySpark Cheat Sheet For RDD Operations

This document provides a cheat sheet on RDD operations in PySpark, including how to: 1. Create RDDs by parallelizing collections, loading from external data sources like text/CSV/JSON files 2. Perform transformations like flatMap, distinct, and pair RDD transformations/actions like reduceByKey 3. Persist RDDs in memory or disk for efficiency and cache/uncache as needed

Uploaded by

Zyad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

185 views1 page

PySpark Cheat Sheet For RDD Operations

Uploaded by

Zyad Ahmed

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

PySpark Cheat Sheet for RDD Operations

Creating RDDs Transformations Persistence Partitioning

1. Parallelizing a Collection: 3. flatMap Transformation # Sample RDD Repartitioning and Coalescing RDDs
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
from pyspark.sql import SparkSession 1. Repartitioning
# Use flatMap to split sentences into words
sentences_rdd = spark.sparkContext.parallelize(["Hello World", "PySpark # Cache RDD in memory
rdd.cache() # Repartition RDD into 5 partitions
# Create a SparkSession Transformation"]) repartitioned_rdd = rdd.repartition(5)
spark = SparkSession.builder \ words_rdd = sentences_rdd.flatMap(lambda sentence: sentence.split())
# Alternatively, persist RDD in memory and disk
.appName("Parallelize Collection") \
# Show the transformed RDD # rdd.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK) 2. Coalescing
.getOrCreate()
print("Words RDD:")
# Perform transformations and actions on cached RDD # Coalesce RDD to 2 partitions
# Sample collection print(words_rdd.collect())
squared_rdd = rdd.map(lambda x: x * x) coalesced_rdd = repartitioned_rdd.coalesce(2)
data = [1, 2, 3, 4, 5]
sum_result = squared_rdd.reduce(lambda x, y: x + y)
4. distinct Transformation
# Create RDD by parallelizing the collection
# Print the result

Broadcasting
rdd = spark.sparkContext.parallelize(data) # Sample RDD with duplicates print("Sum of Squares:", sum_result)
duplicate_rdd = spark.sparkContext.parallelize([1, 2, 2, 3, 3, 3, 4, 5])
# Show the RDD elements
# Unpersist (remove from cache) if no longer needed
print("Parallelized RDD:") # Use distinct to get unique elements rdd.unpersist()
print(rdd.collect()) distinct_rdd = duplicate_rdd.distinct() Using the broadcast Function
from pyspark.sql import SparkSession
# Show the distinct RDD
from pyspark.sql.functions import broadcast
2. Loading Data from External Sources: print("Distinct RDD:")
print(distinct_rdd.collect())
PySpark supports loading data from various external sources,
Pair RDDs
# Create a SparkSession
including text files, CSV files, JSON files, and more. This method spark = SparkSession.builder \
is suitable for larger datasets that are stored externally. .appName("Broadcast Example") \
.getOrCreate()
Pair RDD Transformations and Actions
# Load text file as an RDD
text_file_path = "path/to/text/file.txt"
Actions # Sample DataFrame and RDD
small_df = spark.createDataFrame([(1, 'A'), (2, 'B')], ['key', 'value'])
text_rdd = spark.sparkContext.textFile(text_file_path)
1. reduceByKey Transformation
large_rdd = spark.sparkContext.parallelize([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')])

# Load CSV file as an RDD 1. collect Action # Sample Pair RDD # Broadcast small DataFrame for efficient join
csv_file_path = "path/to/csv/file.csv" pair_rdd = spark.sparkContext.parallelize([(1, 10), (2, 20), (1, 30), (2, 40)]) joined_df = large_rdd.toDF(['key', 'name']).join(broadcast(small_df), on="key")
csv_rdd = spark.sparkContext.textFile(csv_file_path) # Sample RDD
# Use reduceByKey to sum values for each key # Show the joined DataFrame
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Load JSON file as an RDD sum_by_key = pair_rdd.reduceByKey(lambda x, y: x + y) joined_df.show()
json_file_path = "path/to/json/file.json" # Use collect to retrieve all elements
json_rdd = spark.sparkContext.textFile(json_file_path) # Show the result
all_elements = rdd.collect()
print("Sum by Key:")
# Show the first few elements of each RDD print(sum_by_key.collect())
# Print the collected elements

Error Handling & Fault Tolerance

print("Text File RDD:") print("All Elements:")
print(text_rdd.take(5)) print(all_elements)
print("CSV File RDD:") 2. groupByKey Transformation
print(csv_rdd.take(5))
print("JSON File RDD:") 2. count Action # Use groupByKey to group values by key PySpark provides mechanisms to handle errors and recover from faults,
print(json_rdd.take(5)) grouped_values = pair_rdd.groupByKey() ensuring reliable data processing. Example:
# Use count to get the number of elements from pyspark.sql import SparkSession
element_count = rdd.count() # Show the grouped values
print("Grouped Values:") # Create a SparkSession
# Print the element count for key, values in grouped_values.collect(): spark = SparkSession.builder \
print("Element Count:", element_count) print(key, list(values)) .appName("Error Handling Example") \

Transformations 3. take Action 3. join Transformation

.getOrCreate()

try:
# Sample RDD
1. map Transformation # Use take to get the first 3 elements # Another Pair RDD rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
first_three_elements = rdd.take(3) another_pair_rdd = spark.sparkContext.parallelize([(1, 'A'), (2, 'B')])
# Transformation with potential error
# Sample RDD # Print the first three elements # Use join to combine Pair RDDs using keys transformed_rdd = rdd.map(lambda x: 10 / x)
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print("First Three Elements:") joined_rdd = pair_rdd.join(another_pair_rdd)
print(first_three_elements) # Action triggering computation
# Use map to double each element
# Show the joined result result = transformed_rdd.collect()
doubled_rdd = rdd.map(lambda x: x * 2)
4. reduce Action print("Joined RDD:") except ZeroDivisionError as e:
print(joined_rdd.collect()) print("Error:", e)
# Show the transformed RDD
print("Doubled RDD:") # Use reduce to calculate the sum of elements
print(doubled_rdd.collect()) sum_result = rdd.reduce(lambda x, y: x + y)
4. sortByKey Transformation
# Print the sum result
2. filter Transformation print("Sum of Elements:", sum_result) # Use sortByKey to sort elements by key
sorted_rdd = pair_rdd.sortByKey()
# Use filter to keep even numbers
even_rdd = rdd.filter(lambda x: x % 2 == 0) 5. foreach Action # Show the sorted result
print("Sorted RDD:")
# Show the filtered RDD # Use foreach to print each element print(sorted_rdd.collect())
print("Even Numbers RDD:") def print_element(element):
print(even_rdd.collect()) print("Element:", element)

rdd.foreach(print_element)

5. foreach Action
# Save the RDD elements to a text ﬁle
rdd.saveAsTextFile("path/to/output")

PySpark Comprehensive Notes⚡
No ratings yet
PySpark Comprehensive Notes⚡
59 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
PySpark_RDD_Cheat_Sheet
No ratings yet
PySpark_RDD_Cheat_Sheet
1 page
Scala Notes
No ratings yet
Scala Notes
71 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Intro To Apache Spark: Paco Nathan, Download Slides
No ratings yet
Intro To Apache Spark: Paco Nathan, Download Slides
86 pages
Spark Notes
No ratings yet
Spark Notes
37 pages
Apache Spark Analytics Made Simple
No ratings yet
Apache Spark Analytics Made Simple
76 pages
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Pyspark Cashing & Persisting - Complete Guide
No ratings yet
Pyspark Cashing & Persisting - Complete Guide
3 pages
Pyspark Vs Spark SQL
No ratings yet
Pyspark Vs Spark SQL
6 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Apache Spark
No ratings yet
Apache Spark
62 pages
Data Bricks
No ratings yet
Data Bricks
43 pages
3GPP TS 22.261
No ratings yet
3GPP TS 22.261
51 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Py Spark
No ratings yet
Py Spark
427 pages
Download the PDF of Solution Manual for Problem Solving with C++ 10th Edition Savitch to read all chapters
100% (17)
Download the PDF of Solution Manual for Problem Solving with C++ 10th Edition Savitch to read all chapters
43 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
No ratings yet
Best Practices For Bucketing in Spark SQL - by David Vrba - Towards Data Science
27 pages
Apache Druid: Sudhindra Tirupati Nagaraj
No ratings yet
Apache Druid: Sudhindra Tirupati Nagaraj
12 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Step by Step tutorial on BDC Session Method Program in SAP ABAP
No ratings yet
Step by Step tutorial on BDC Session Method Program in SAP ABAP
22 pages
Scala Currying
No ratings yet
Scala Currying
13 pages
L1 - Chapter 6 - Risks To Data and Information
No ratings yet
L1 - Chapter 6 - Risks To Data and Information
24 pages
Simulado Databricks
No ratings yet
Simulado Databricks
25 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
ECSE 426 Lecture 1 - 4 Majerus
No ratings yet
ECSE 426 Lecture 1 - 4 Majerus
11 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Snowflake Demo
No ratings yet
Snowflake Demo
13 pages
Infinite Reality
No ratings yet
Infinite Reality
42 pages
HPE Aruba Networking AP-575 (RW) 802.11ax 2x2,4x4 Dual Radio Integrated Omni Antennas Outdoor AP
No ratings yet
HPE Aruba Networking AP-575 (RW) 802.11ax 2x2,4x4 Dual Radio Integrated Omni Antennas Outdoor AP
7 pages
Spark DataFrames Project Exercise - Jupyter Notebook
No ratings yet
Spark DataFrames Project Exercise - Jupyter Notebook
7 pages
COA Assignment-1
No ratings yet
COA Assignment-1
1 page
Scala PDF
No ratings yet
Scala PDF
29 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Learning Real-Time Processing With Spark Streaming - Sample Chapter
No ratings yet
Learning Real-Time Processing With Spark Streaming - Sample Chapter
30 pages
MergeResult 2023 11 12 07 29 01
No ratings yet
MergeResult 2023 11 12 07 29 01
25 pages
Bigdata Notes
No ratings yet
Bigdata Notes
26 pages
Great Learning Online Cyber Security Course
No ratings yet
Great Learning Online Cyber Security Course
14 pages
Professional Hadoop Solutions 1st Edition Boris Lublinsky all chapter instant download
100% (2)
Professional Hadoop Solutions 1st Edition Boris Lublinsky all chapter instant download
52 pages
+ - OSDI2020-FIRM - An Intelligent Fine-Grained Resource Management Framework For SLO-Oriented Microservices
No ratings yet
+ - OSDI2020-FIRM - An Intelligent Fine-Grained Resource Management Framework For SLO-Oriented Microservices
22 pages
Spark Interview Ques1
No ratings yet
Spark Interview Ques1
20 pages
Process and Control System (General)
No ratings yet
Process and Control System (General)
16 pages
A Panda3D Hello World Tutorial 1
No ratings yet
A Panda3D Hello World Tutorial 1
33 pages
Tps 53513
No ratings yet
Tps 53513
42 pages
Ai Syllabus
No ratings yet
Ai Syllabus
2 pages
Advancements in ATM Security: A Biometric Approach For Fingerprint and Face Recognition Access Control System
No ratings yet
Advancements in ATM Security: A Biometric Approach For Fingerprint and Face Recognition Access Control System
6 pages
cs107 Reader
No ratings yet
cs107 Reader
145 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
IOS Security Model (TAJUK 10)
No ratings yet
IOS Security Model (TAJUK 10)
13 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
VC Product Wiring Diagrams
No ratings yet
VC Product Wiring Diagrams
42 pages
Project OOP
No ratings yet
Project OOP
4 pages
SIM7020 Series - TCPIP - Application Note - V1.04
No ratings yet
SIM7020 Series - TCPIP - Application Note - V1.04
36 pages
Manual Eas Atlantis 900 PCT 925 Maquina 13
No ratings yet
Manual Eas Atlantis 900 PCT 925 Maquina 13
4 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
A1 Worksheet - Lesson Activities Worksheet
No ratings yet
A1 Worksheet - Lesson Activities Worksheet
3 pages
Chapter 9 Memory
No ratings yet
Chapter 9 Memory
10 pages
EC-Council: Exam Questions 312-50v11
No ratings yet
EC-Council: Exam Questions 312-50v11
13 pages
Unit Ii Levels of Testing
No ratings yet
Unit Ii Levels of Testing
54 pages
Hive Cheat Sheet - Quick Reference
No ratings yet
Hive Cheat Sheet - Quick Reference
19 pages
Sqoop Cammand
No ratings yet
Sqoop Cammand
8 pages
Languages and Compilers (Sprog Og Oversættere) : Bent Thomsen Department of Computer Science Aalborg University
No ratings yet
Languages and Compilers (Sprog Og Oversættere) : Bent Thomsen Department of Computer Science Aalborg University
41 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
3 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
No ratings yet
Microsoft Certified: Azure Data Engineer Associate - Skills Measured
4 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
543 Net Architecture Interview Questions Answers Guide
No ratings yet
543 Net Architecture Interview Questions Answers Guide
7 pages
Quiz Total
No ratings yet
Quiz Total
5 pages
Apache Cassandra
No ratings yet
Apache Cassandra
3 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
No ratings yet
SQL Interview Questions and Answers: What Is SQL and Where Does It Come From?
9 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Spark Tutorial
No ratings yet
Spark Tutorial
8 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
From Everand
Mastering Data Engineering and Analytics with Databricks: A Hands-on Guide to Build Scalable Pipelines Using Databricks, Delta Lake, and MLflow (English Edition)
Manoj Kumar
No ratings yet
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
From Everand
Databricks Certified Associate Developer for Apache Spark Using Python: The ultimate guide to getting certified in Apache Spark using practical examples with Python
Saba Shah
No ratings yet
Spark Cookbook
From Everand
Spark Cookbook
Rishi Yadav
No ratings yet
PostgreSQL 9 High Availability Cookbook
From Everand
PostgreSQL 9 High Availability Cookbook
Shaun M. Thomas
5/5 (2)
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet

PySpark Cheat Sheet For RDD Operations

Uploaded by

PySpark Cheat Sheet For RDD Operations

Uploaded by

PySpark Cheat Sheet for RDD Operations

Creating RDDs Transformations Persistence Partitioning

Error Handling & Fault Tolerance

Transformations 3. take Action 3. join Transformation

You might also like