0% found this document useful (0 votes)
185 views1 page

PySpark Cheat Sheet For RDD Operations

This document provides a cheat sheet on RDD operations in PySpark, including how to: 1. Create RDDs by parallelizing collections, loading from external data sources like text/CSV/JSON files 2. Perform transformations like flatMap, distinct, and pair RDD transformations/actions like reduceByKey 3. Persist RDDs in memory or disk for efficiency and cache/uncache as needed

Uploaded by

Zyad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views1 page

PySpark Cheat Sheet For RDD Operations

This document provides a cheat sheet on RDD operations in PySpark, including how to: 1. Create RDDs by parallelizing collections, loading from external data sources like text/CSV/JSON files 2. Perform transformations like flatMap, distinct, and pair RDD transformations/actions like reduceByKey 3. Persist RDDs in memory or disk for efficiency and cache/uncache as needed

Uploaded by

Zyad Ahmed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

PySpark Cheat Sheet for RDD Operations

Creating RDDs Transformations Persistence Partitioning


1. Parallelizing a Collection: 3. flatMap Transformation # Sample RDD Repartitioning and Coalescing RDDs
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
from pyspark.sql import SparkSession 1. Repartitioning
# Use flatMap to split sentences into words
sentences_rdd = spark.sparkContext.parallelize(["Hello World", "PySpark # Cache RDD in memory
rdd.cache() # Repartition RDD into 5 partitions
# Create a SparkSession Transformation"]) repartitioned_rdd = rdd.repartition(5)
spark = SparkSession.builder \ words_rdd = sentences_rdd.flatMap(lambda sentence: sentence.split())
# Alternatively, persist RDD in memory and disk
.appName("Parallelize Collection") \
# Show the transformed RDD # rdd.persist(storageLevel=pyspark.StorageLevel.MEMORY_AND_DISK) 2. Coalescing
.getOrCreate()
print("Words RDD:")
# Perform transformations and actions on cached RDD # Coalesce RDD to 2 partitions
# Sample collection print(words_rdd.collect())
squared_rdd = rdd.map(lambda x: x * x) coalesced_rdd = repartitioned_rdd.coalesce(2)
data = [1, 2, 3, 4, 5]
sum_result = squared_rdd.reduce(lambda x, y: x + y)
4. distinct Transformation
# Create RDD by parallelizing the collection
# Print the result

Broadcasting
rdd = spark.sparkContext.parallelize(data) # Sample RDD with duplicates print("Sum of Squares:", sum_result)
duplicate_rdd = spark.sparkContext.parallelize([1, 2, 2, 3, 3, 3, 4, 5])
# Show the RDD elements
# Unpersist (remove from cache) if no longer needed
print("Parallelized RDD:") # Use distinct to get unique elements rdd.unpersist()
print(rdd.collect()) distinct_rdd = duplicate_rdd.distinct() Using the broadcast Function
from pyspark.sql import SparkSession
# Show the distinct RDD
from pyspark.sql.functions import broadcast
2. Loading Data from External Sources: print("Distinct RDD:")
print(distinct_rdd.collect())
PySpark supports loading data from various external sources,
Pair RDDs
# Create a SparkSession
including text files, CSV files, JSON files, and more. This method spark = SparkSession.builder \
is suitable for larger datasets that are stored externally. .appName("Broadcast Example") \
.getOrCreate()
Pair RDD Transformations and Actions
# Load text file as an RDD
text_file_path = "path/to/text/file.txt"
Actions # Sample DataFrame and RDD
small_df = spark.createDataFrame([(1, 'A'), (2, 'B')], ['key', 'value'])
text_rdd = spark.sparkContext.textFile(text_file_path)
1. reduceByKey Transformation
large_rdd = spark.sparkContext.parallelize([(1, 'Alice'), (2, 'Bob'), (3, 'Charlie')])

# Load CSV file as an RDD 1. collect Action # Sample Pair RDD # Broadcast small DataFrame for efficient join
csv_file_path = "path/to/csv/file.csv" pair_rdd = spark.sparkContext.parallelize([(1, 10), (2, 20), (1, 30), (2, 40)]) joined_df = large_rdd.toDF(['key', 'name']).join(broadcast(small_df), on="key")
csv_rdd = spark.sparkContext.textFile(csv_file_path) # Sample RDD
# Use reduceByKey to sum values for each key # Show the joined DataFrame
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
# Load JSON file as an RDD sum_by_key = pair_rdd.reduceByKey(lambda x, y: x + y) joined_df.show()
json_file_path = "path/to/json/file.json" # Use collect to retrieve all elements
json_rdd = spark.sparkContext.textFile(json_file_path) # Show the result
all_elements = rdd.collect()
print("Sum by Key:")
# Show the first few elements of each RDD print(sum_by_key.collect())
# Print the collected elements

Error Handling & Fault Tolerance


print("Text File RDD:") print("All Elements:")
print(text_rdd.take(5)) print(all_elements)
print("CSV File RDD:") 2. groupByKey Transformation
print(csv_rdd.take(5))
print("JSON File RDD:") 2. count Action # Use groupByKey to group values by key PySpark provides mechanisms to handle errors and recover from faults,
print(json_rdd.take(5)) grouped_values = pair_rdd.groupByKey() ensuring reliable data processing. Example:
# Use count to get the number of elements from pyspark.sql import SparkSession
element_count = rdd.count() # Show the grouped values
print("Grouped Values:") # Create a SparkSession
# Print the element count for key, values in grouped_values.collect(): spark = SparkSession.builder \
print("Element Count:", element_count) print(key, list(values)) .appName("Error Handling Example") \

Transformations 3. take Action 3. join Transformation


.getOrCreate()

try:
# Sample RDD
1. map Transformation # Use take to get the first 3 elements # Another Pair RDD rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
first_three_elements = rdd.take(3) another_pair_rdd = spark.sparkContext.parallelize([(1, 'A'), (2, 'B')])
# Transformation with potential error
# Sample RDD # Print the first three elements # Use join to combine Pair RDDs using keys transformed_rdd = rdd.map(lambda x: 10 / x)
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5]) print("First Three Elements:") joined_rdd = pair_rdd.join(another_pair_rdd)
print(first_three_elements) # Action triggering computation
# Use map to double each element
# Show the joined result result = transformed_rdd.collect()
doubled_rdd = rdd.map(lambda x: x * 2)
4. reduce Action print("Joined RDD:") except ZeroDivisionError as e:
print(joined_rdd.collect()) print("Error:", e)
# Show the transformed RDD
print("Doubled RDD:") # Use reduce to calculate the sum of elements
print(doubled_rdd.collect()) sum_result = rdd.reduce(lambda x, y: x + y)
4. sortByKey Transformation
# Print the sum result
2. filter Transformation print("Sum of Elements:", sum_result) # Use sortByKey to sort elements by key
sorted_rdd = pair_rdd.sortByKey()
# Use filter to keep even numbers
even_rdd = rdd.filter(lambda x: x % 2 == 0) 5. foreach Action # Show the sorted result
print("Sorted RDD:")
# Show the filtered RDD # Use foreach to print each element print(sorted_rdd.collect())
print("Even Numbers RDD:") def print_element(element):
print(even_rdd.collect()) print("Element:", element)

rdd.foreach(print_element)

5. foreach Action
# Save the RDD elements to a text file
rdd.saveAsTextFile("path/to/output")

You might also like