PySpark Cheatsheet
PySpark Cheatsheet
1. PySpark Overview
• Core Components:
3. PySpark Architecture
• Cluster Manager:
• Execution Process:
• Transformations:
• Actions:
5. PySpark SQL
• Creating a Table:
• df.createOrReplaceTempView("table_name")
6. Window Functions
• Types:
• Example:
• df.withColumn("rank", rank().over(window_spec))
• DataFrame API:
o Pythonic syntax.
• SQL API:
o SQL-like syntax.
o Example: spark.sql("SELECT col1, col2 FROM table WHERE col3 > 10")
• df.cache()
• df.persist(StorageLevel.DISK_ONLY)
9. Joins in PySpark
• Types of Joins:
• df = large_df.join(broadcast(small_df), "key")
• Reading Data:
• df = spark.read.format("csv").option("header", True).load("path")
• Writing Data:
• df.write.format("parquet").save("path")
• df.repartition(10)
• df.coalesce(1)
• Reading Streams:
• df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").load()
• Writing Streams:
• query = df.writeStream.format("console").start()
• query.awaitTermination()
• Common Exceptions:
• Debugging:
1. PySpark Basics
Initialize SparkSession:
spark = SparkSession.builder.appName("AppName").getOrCreate()
Create DataFrame:
df = spark.createDataFrame(data, columns)
Inspect DataFrame:
Read/Write Data:
# Read
# Write
df.write.csv("output_path", header=True)
2. PySpark SQL
SQL Queries:
df.createOrReplaceTempView("table")
spark.sql("SELECT * FROM table WHERE id > 1").show()
Joins:
3. Transformations
Basic Transformations:
df.groupBy("column").agg(count("*").alias("count"), avg("col2")).show()
Window Functions:
window = Window.partitionBy("category").orderBy("sales")
df.withColumn("rank", rank().over(window)).show()
4. PySpark Functions
Common Functions:
Date Functions:
from pyspark.sql.functions import current_date, datediff
df = df.withColumn("today", current_date())
mapped_rdd = rdd.map(lambda x: x * 2)
print(filtered_rdd.collect())
Actions:
print(rdd.count())
print(rdd.collect())
Transformations:
union_rdd = rdd1.union(rdd2)
intersection_rdd = rdd1.intersection(rdd2)
6. PySpark Optimization
Repartition:
window = Window.partitionBy("category").orderBy("sales")
df.withColumn("row_number", row_number().over(window)).show()
3. Aggregate Example:
df.groupBy("department").agg(
count("*").alias("count"),
avg("salary").alias("avg_salary")
).show()
Broadcast Joins:
df_large.join(broadcast(df_small), "key").show()
def uppercase(name):
return name.upper()
Accumulators:
acc = spark.sparkContext.accumulator(0)
def add_to_acc(value):
acc.add(value)
rdd.foreach(add_to_acc)
print(acc.value)
• Definition: Optimizes join operations when one DataFrame is small enough to fit
in memory.
• Syntax:
• Usage: Perform operations like ranking, cumulative sums, etc., over a specific
window of rows.
• Example:
window_spec = Window.partitionBy("department").orderBy("salary")
ranked_df = employees.withColumn("rank", rank().over(window_spec))
ranked_df.show()
df_repartitioned = df.repartition(4)
df_coalesced = df.coalesce(2)
14. Accumulators
• Syntax:
acc = spark.sparkContext.accumulator(0)
rdd.foreach(lambda x: acc.add(1))
print(acc.value)
df.cache()
• Salting: Add random prefixes to keys to distribute data evenly during joins.
• Example:
• Reading/Writing to Kafka:
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic1").load()
df.writeStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("topic", "topic2").start()
• Avro:
df.write.format("avro").save("path")
• ORC:
df.write.format("orc").save("path")
• Common Parameters:
• Example:
spark.conf.set("spark.sql.shuffle.partitions", 50)