0% found this document useful (0 votes)

175 views

PySpark Cheatsheet

Uploaded by

sambitsouravd025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

175 views

PySpark Cheatsheet

Uploaded by

sambitsouravd025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

PySpark Cheatsheet

1. PySpark Overview

• Definition: PySpark is the Python API for Apache Spark, an open-source,

distributed computing framework.

• Core Components:

o RDD (Resilient Distributed Dataset): Immutable distributed

collections of objects.

o DataFrame: Distributed table with named columns; optimized for SQL

queries.

o Dataset: Strongly typed, distributed data structure (available in

Scala/Java).

• Languages Supported: Python, Scala, Java, and R.

2. Core Spark Concepts

• Driver: Manages the execution of tasks across the cluster.

• Executor: Performs computations and stores data on worker nodes.

• Partition: Logical division of data for parallel processing.

• Transformations: Create a new RDD/DataFrame from an existing one (e.g.,

map, filter).

• Actions: Trigger execution of transformations and return results (e.g., count,

collect).

3. PySpark Architecture

• Cluster Manager:

o YARN, Mesos, or Standalone cluster.

• Execution Process:

1. Job submitted by Driver.

2. Tasks divided into Stages based on shuffle boundaries.

3. Tasks run on Executors.

4. Common PySpark Operations

• Transformations:

o map: Applies a function to each element.

o filter: Filters elements based on a condition.

o groupBy: Groups data by a key.

o join: Joins two DataFrames based on a condition.

• Actions:

o show: Displays DataFrame.

o collect: Brings data to the driver.

o count: Counts the number of elements.

5. PySpark SQL

• Creating a Table:

• df.createOrReplaceTempView("table_name")

• spark.sql("SELECT * FROM table_name")

• Common SQL Functions:

o agg: Perform aggregations.

o alias: Rename columns.

o distinct: Remove duplicates.

6. Window Functions

• Definition: Perform operations over a window of rows.

• Types:

o Ranking: row_number, rank, dense_rank.

o Aggregations: sum, avg, max, min.

• Example:

• from pyspark.sql.window import Window

• window_spec = Window.partitionBy("col1").orderBy("col2")

• df.withColumn("rank", rank().over(window_spec))

7. DataFrame API vs. SQL API

• DataFrame API:

o Pythonic syntax.

o Example: df.select("col1", "col2").filter(df["col3"] > 10)

• SQL API:

o SQL-like syntax.

o Example: spark.sql("SELECT col1, col2 FROM table WHERE col3 > 10")

8. Persisting and Caching

• Caching: Stores data in memory for faster reuse.

• df.cache()

• Persistence: Allows control over storage levels (e.g., MEMORY_AND_DISK).

• df.persist(StorageLevel.DISK_ONLY)

9. Joins in PySpark

• Types of Joins:

o Inner, Left, Right, Full Outer, Semi, Anti.

• Broadcast Join: Optimized join when one DataFrame is small.

• from pyspark.sql.functions import broadcast

• df = large_df.join(broadcast(small_df), "key")

10. File Formats

• Supported Formats: CSV, JSON, Parquet, Avro, ORC.

• Reading Data:

• df = spark.read.format("csv").option("header", True).load("path")
• Writing Data:

• df.write.format("parquet").save("path")

11. Performance Optimization

• Repartitioning: Adjust the number of partitions for parallelism.

• df.repartition(10)

• Coalesce: Reduce the number of partitions without a shuffle.

• df.coalesce(1)

• Predicate Pushdown: Filters data early in the query execution.

12. Streaming with PySpark

• Reading Streams:

• df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").load()

• Writing Streams:

• query = df.writeStream.format("console").start()

• query.awaitTermination()

13. Error Handling

• Common Exceptions:

o AnalysisException: Invalid query or missing columns.

o Py4JJavaError: Java exception in Spark operations.

• Debugging:

o Use .explain() to understand the query execution plan.

o Check Spark logs for detailed error messages.

14. Common PySpark Interview Questions

• What are the differences between RDD, DataFrame, and Dataset?

• How does Spark handle fault tolerance?

• Explain the concept of lazy evaluation in PySpark.

• How do you optimize joins in PySpark?

• Explain Spark's execution process (job, stages, and tasks).

1. PySpark Basics

Initialize SparkSession:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("AppName").getOrCreate()

Create DataFrame:

data = [(1, "Alice"), (2, "Bob")]

columns = ["id", "name"]

df = spark.createDataFrame(data, columns)

Inspect DataFrame:

df.show() # Display rows

df.printSchema() # Show schema

df.describe().show() # Summary statistics

Read/Write Data:

# Read

df = spark.read.csv("file_path", header=True, inferSchema=True)

# Write

df.write.csv("output_path", header=True)

2. PySpark SQL

SQL Queries:

df.createOrReplaceTempView("table")
spark.sql("SELECT * FROM table WHERE id > 1").show()

Joins:

df1.join(df2, df1["key"] == df2["key"], "inner").show() # Types: inner, left, right, outer

3. Transformations

Basic Transformations:

df.select("column1", "column2").show() # Select columns

df.filter(df["column"] > 10).show() # Filter rows

df.withColumn("new_col", df["col"] * 2).show() # Add column

GroupBy and Aggregations:

from pyspark.sql.functions import count, avg, sum

df.groupBy("column").agg(count("*").alias("count"), avg("col2")).show()

Window Functions:

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

window = Window.partitionBy("category").orderBy("sales")

df.withColumn("rank", rank().over(window)).show()

4. PySpark Functions

Common Functions:

from pyspark.sql.functions import col, lit, concat, when

df = df.withColumn("new_col", concat(col("col1"), lit("_"), col("col2"))) # Concatenate

df = df.withColumn("status", when(df["col"] > 10, "High").otherwise("Low")) #

Conditional

Date Functions:
from pyspark.sql.functions import current_date, datediff

df = df.withColumn("today", current_date())

df = df.withColumn("days_diff", datediff(df["date_col"], df["today"]))

5. PySpark RDD Operations

Basic RDD Operations:

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])

mapped_rdd = rdd.map(lambda x: x * 2)

filtered_rdd = mapped_rdd.filter(lambda x: x > 4)

print(filtered_rdd.collect())

Actions:

print(rdd.count())

print(rdd.collect())

Transformations:

rdd1 = spark.sparkContext.parallelize([1, 2, 3])

rdd2 = spark.sparkContext.parallelize([3, 4, 5])

union_rdd = rdd1.union(rdd2)

intersection_rdd = rdd1.intersection(rdd2)

6. PySpark Optimization

Persist and Cache:

df.cache() # Cache in memory

df.persist() # Persist in memory and disk

df.unpersist() # Remove from cache

Repartition:

df = df.repartition(4) # Increase partitions

df = df.coalesce(2) # Decrease partitions

7. PySpark Interview Patterns

1. Self Join Example:

df.alias("df1").join(df.alias("df2"), col("df1.id") == col("df2.supervisor"), "inner").show()

2. Window Function Example:

from pyspark.sql.functions import row_number

window = Window.partitionBy("category").orderBy("sales")

df.withColumn("row_number", row_number().over(window)).show()

3. Aggregate Example:

df.groupBy("department").agg(

count("*").alias("count"),

avg("salary").alias("avg_salary")

).show()

8. PySpark Advanced Topics

Broadcast Joins:

from pyspark.sql.functions import broadcast

df_large.join(broadcast(df_small), "key").show()

UDF (User-Defined Functions):

from pyspark.sql.functions import udf

from pyspark.sql.types import StringType

def uppercase(name):

return name.upper()

uppercase_udf = udf(uppercase, StringType())

df.withColumn("uppercase_name", uppercase_udf(df["name"])).show()

Accumulators:

acc = spark.sparkContext.accumulator(0)

def add_to_acc(value):

acc.add(value)

rdd.foreach(add_to_acc)

print(acc.value)

11. Broadcast Joins

• Definition: Optimizes join operations when one DataFrame is small enough to fit
in memory.

• Syntax:

from pyspark.sql.functions import broadcast

result = large_df.join(broadcast(small_df), "key")

• Use Case: Useful for improving performance by avoiding shuffle operations.

12. Window Functions

• Usage: Perform operations like ranking, cumulative sums, etc., over a specific
window of rows.

• Example:

from pyspark.sql.window import Window

from pyspark.sql.functions import rank, col

window_spec = Window.partitionBy("department").orderBy("salary")
ranked_df = employees.withColumn("rank", rank().over(window_spec))

ranked_df.show()

• Common Functions: row_number, rank, dense_rank, lag, lead, ntile.

13. Data Partitioning

• Repartitioning: Changes the number of partitions.

df_repartitioned = df.repartition(4)

• Coalesce: Reduces the number of partitions without shuffling.

df_coalesced = df.coalesce(2)

14. Accumulators

• Definition: Variables used to perform aggregations.

• Syntax:

acc = spark.sparkContext.accumulator(0)

rdd.foreach(lambda x: acc.add(1))

print(acc.value)

15. Caching and Persistence

• Caching: Stores RDD/DataFrame in memory for reuse.

df.cache()

• Persistence: Allows specifying storage levels (e.g., memory, disk).

df.persist(StorageLevel.MEMORY_AND_DISK)

16. Skew Handling

• Salting: Add random prefixes to keys to distribute data evenly during joins.

• Example:

df_with_salt = df.withColumn("salted_key", concat(col("key"), lit("_"), col("random_id")))

17. Fault Tolerance

• RDD Lineage: RDDs keep track of transformations for automatic recovery in

case of failure.

• Action Retry: Automatically retries failed tasks.

18. Integration with Other Tools

• Integration with Hive:

spark.sql("SELECT * FROM hive_table")

• Reading/Writing to Kafka:

df = spark.readStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("subscribe", "topic1").load()

df.writeStream.format("kafka").option("kafka.bootstrap.servers",
"localhost:9092").option("topic", "topic2").start()

19. Advanced File Formats

• Avro:

df.write.format("avro").save("path")

• ORC:
df.write.format("orc").save("path")

20. Performance Tuning

• Common Parameters:

o spark.sql.shuffle.partitions: Adjust for better parallelism.

o spark.executor.memory: Increase memory for executors.

o spark.executor.cores: Set the number of cores per executor.

• Example:

spark.conf.set("spark.sql.shuffle.partitions", 50)

JAVA Coding Interview Questions and Answers PDF
100% (1)
JAVA Coding Interview Questions and Answers PDF
247 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Databricks Project
No ratings yet
Databricks Project
1 page
50 PySpark Interview Questions.pdf
No ratings yet
50 PySpark Interview Questions.pdf
7 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Spark Interview Q&A
No ratings yet
Spark Interview Q&A
31 pages
azure DE interview que
100% (1)
azure DE interview que
25 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Master_Snowflake_Interview_Q_A_�_1729835390
No ratings yet
Master_Snowflake_Interview_Q_A_�_1729835390
7 pages
Spark With Python Notes
No ratings yet
Spark With Python Notes
206 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Spark Concept
No ratings yet
Spark Concept
18 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Azure DataEngineering End To End Videos
No ratings yet
Azure DataEngineering End To End Videos
21 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Spark Interview QUestions
No ratings yet
Spark Interview QUestions
200 pages
Kanishk Resume
No ratings yet
Kanishk Resume
5 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
Window Function in Pyspark
100% (1)
Window Function in Pyspark
8 pages
Data Bricks
No ratings yet
Data Bricks
20 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
Databricks Pyspark 1712042928
100% (1)
Databricks Pyspark 1712042928
21 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Structured Streaming
No ratings yet
Structured Streaming
12 pages
O Reilly Data Lake Bootcamp Day 11694182865124
No ratings yet
O Reilly Data Lake Bootcamp Day 11694182865124
46 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
Spark Interview Questions 1713805760
No ratings yet
Spark Interview Questions 1713805760
40 pages
ABD00 Notebooks Combined - Databricks
No ratings yet
ABD00 Notebooks Combined - Databricks
109 pages
Databricks
No ratings yet
Databricks
11 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
No ratings yet
Spark Interview Questions IV. Next Installment of The Series. - by Amit Singh Rathore - Dev Genius
15 pages
SQL Interview
No ratings yet
SQL Interview
73 pages
Apache Spark Architecture
No ratings yet
Apache Spark Architecture
7 pages
Spark Optimization PDF
100% (1)
Spark Optimization PDF
14 pages
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
No ratings yet
L02 - Spark SQL For Data Processing: CBG1C04 Big Data Programming
23 pages
azure comapny wise question
No ratings yet
azure comapny wise question
68 pages
Spark Walmart Data Analysis Project
No ratings yet
Spark Walmart Data Analysis Project
17 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Pyspark Hands on
No ratings yet
Pyspark Hands on
189 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Databricks Final
100% (1)
Databricks Final
81 pages
Spark Syllabus 1
No ratings yet
Spark Syllabus 1
3 pages
Performance Tuning Spark UI
No ratings yet
Performance Tuning Spark UI
37 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Interview Questions
No ratings yet
Interview Questions
2 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
HBase Administration Cookbook
From Everand
HBase Administration Cookbook
Yifeng Jiang
No ratings yet
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
Chapter 4 Register Transfer Language and Micro Operations
No ratings yet
Chapter 4 Register Transfer Language and Micro Operations
23 pages
Log
No ratings yet
Log
2 pages
Ques. On Array & Linked List
0% (1)
Ques. On Array & Linked List
5 pages
Ccs C Manual
No ratings yet
Ccs C Manual
597 pages
C++ - Chapter 1 - The Big Picture
No ratings yet
C++ - Chapter 1 - The Big Picture
14 pages
Return To Libc Ex
No ratings yet
Return To Libc Ex
3 pages
My Step by Step Sequence in Rebuilding Fedora7 Kernel
No ratings yet
My Step by Step Sequence in Rebuilding Fedora7 Kernel
16 pages
C++ Programming de Gruyter
No ratings yet
C++ Programming de Gruyter
507 pages
RO
No ratings yet
RO
3 pages
PSEUDOCODE
No ratings yet
PSEUDOCODE
8 pages
Xss Payloads
No ratings yet
Xss Payloads
132 pages
Osslip 1
No ratings yet
Osslip 1
88 pages
A Tutorial On Reporting in JAVA Using Jasper Reports, Ireport and JFreeChart
No ratings yet
A Tutorial On Reporting in JAVA Using Jasper Reports, Ireport and JFreeChart
21 pages
Access VBA
No ratings yet
Access VBA
155 pages
Chapter 9 Polymorphism
No ratings yet
Chapter 9 Polymorphism
3 pages
Internship Zenix
No ratings yet
Internship Zenix
9 pages
Practical Exam - SE1507 - 03/19/2021: I. Queries in SQL
No ratings yet
Practical Exam - SE1507 - 03/19/2021: I. Queries in SQL
3 pages
Java Module 3
No ratings yet
Java Module 3
61 pages
99003e06 Manual Proprogwt
No ratings yet
99003e06 Manual Proprogwt
236 pages
Prácticas Bigdata: 1. Lanzar Un Proceso Mapreduce Contra El Cluster
No ratings yet
Prácticas Bigdata: 1. Lanzar Un Proceso Mapreduce Contra El Cluster
3 pages
Introduction To Design Patterns: 1 2. COSC 301: Data Structures
No ratings yet
Introduction To Design Patterns: 1 2. COSC 301: Data Structures
15 pages
Qspider Testing Sllybus
No ratings yet
Qspider Testing Sllybus
11 pages
Project Questions in Java
No ratings yet
Project Questions in Java
3 pages
F PDF
100% (1)
F PDF
86 pages
ayza
No ratings yet
ayza
12 pages
God'S Eye: Project Idea: Complete User's Activities Tracking With Remote Assistance
No ratings yet
God'S Eye: Project Idea: Complete User's Activities Tracking With Remote Assistance
3 pages
Assignment Problem PDF
No ratings yet
Assignment Problem PDF
18 pages
Date, A C++ Class For Dates.: Bernt Arne Ødegaard
No ratings yet
Date, A C++ Class For Dates.: Bernt Arne Ødegaard
6 pages
Programming in C Syllabus
No ratings yet
Programming in C Syllabus
7 pages