0% found this document useful (0 votes)

43 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views18 pages

PySpark Transformations

Uploaded by

swamisamarth934

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Can you explain the

different Transformation
you’ve done in your
project?

Be Prepared
Learn 50 Pyspark
Transformation
to Stand Out
Abhishek Agrawal
Azure Data Engineer
1. Normalization
Scaling data to a range between 0 and 1.

from pyspark.ml.feature import MinMaxScaler

scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

2. Standardization
Transforming data to have zero mean and unit variance

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

3. Log Transformation
Applying a logarithmic transformation to handle skewed data.

from pyspark.ml.feature import StandardScaler

scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = sfrom pyspark.ml.feature import StandardScaler

# Initialize the StandardScaler

scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features"
)

# Fit the scaler to the dataset and transform the data

scaled_data = scaler.fit(data).transform(data)
caler.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

4. Binning
Grouping continuous values into discrete bins.

from pyspark.sql.functions import when

# Add a new column 'bin_column' based on conditions

data = data.withColumn(
"bin_column",
when(data["value"] < 10, "Low")
.when(data["value"] < 20, "Medium")
.otherwise("High")
)

5. One-Hot Encoding
Converting categorical variables into binary columns.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Step 1: Indexing the categorical column

indexer = StringIndexer(inputCol="category", outputCol="category_index")
indexed_data = indexer.fit(data).transform(data)

# Step 2: One-hot encoding the indexed column

encoder = OneHotEncoder(inputCol="category_index", outputCol="category_onehot")
encoded_data = encoder.fit(indexed_data).transform(indexed_data)

6. Label Encoding
Converting categorical values into integer labels.

from pyspark.ml.feature import StringIndexer

# Step 1: Create a StringIndexer to index the 'category' column

indexer = StringIndexer(inputCol="category", outputCol="category_index")

# Step 2: Fit the indexer on the data and transform it

indexed_data = indexer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

7. Pivoting
Pivoting is the process of transforming long-format data (where each row
represents a single observation or record) into wide-format data (where
each column represents a different attribute or category). This
transformation is typically used when you want to turn a categorical
variable into columns and aggregate values accordingly.

# Pivoting the data to create a summary of sales by month for each ID

pivoted_data = data.groupBy("id") \
.pivot("month") \
.agg({"sales": "sum"})
= data.groupBy("id").pivot("month").agg({"sales": "sum"})

8. Unpivoting
Unpivoting is the opposite of pivoting. It transforms wide-format data
(where each column represents a different category or attribute) into
long-format data (where each row represents a single observation). This
is useful when you want to turn column headers back into values.

# Unpivoting the data to convert columns into rows

unpivoted_data = data.selectExpr(
"id",
"stack(2, 'Jan', Jan, 'Feb', Feb) as (month, sales)"
)

9. Aggregation
Summarizing data by applying functions like sum(), avg(), etc.

# Aggregating data by category to compute the sum of values

aggregated_data = data.groupBy("category") \
.agg({"value": "sum"})

Abhishek Agrawal | Azure Data Engineer

10. Feature Extraction
Extracting useful features from raw data.

from pyspark.sql.functions import year, month, dayofmonth

# Add year, month, and day columns to the DataFrame

data = (
data
.withColumn("year", year(data["timestamp"]))
.withColumn("month", month(data["timestamp"]))
.withColumn("day", dayofmonth(data["timestamp"]))
)

11. Outlier Removal

Filtering out extreme values (outliers).

# Filter rows where the 'value' column is less than 1000

filtered_data = data.filter(data["value"] < 1000)

12. Data Imputation

Filling missing values with the mean or median.

from pyspark.ml.feature import Imputer

# Create an Imputer instance

imputer = Imputer(inputCols=["column"], outputCols=["imputed_column"])

# Fit the imputer model and transform the data

imputed_data = imputer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer

13. Date/Time Parsing
Converting string to datetime objects.

from pyspark.sql.functions import to_timestamp

# Convert the 'date_string' column to a timestamp with the specified format

data = data.withColumn("timestamp", to_timestamp(data["date_string"], "yyyy-MM-dd"))

14. Text Transformation

Converting text to lowercase.

from pyspark.sql.functions import lower

# Convert the text in 'text_column' to lowercase and store it in a new column

data = data.withColumn("lowercase_text", lower(data["text_column"]))

15. Data Merging

Combining two datasets based on a common column.

# Perform an inner join between data1 and data2 on the 'id' column
merged_data = data1.join(data2, data1["id"] == data2["id"], "inner")

16. Data Joining

Joining data using inner, left, or right joins.

# Perform a left join between data1 and data2 on the 'id' column
joined_data = data1.join(data2, on="id", how="left")

Abhishek Agrawal | Azure Data Engineer

17. Filtering Rows
Filtering rows based on a condition.

# Filter rows where the 'value' column is greater than 10

filtered_data = data.filter(data["value"] > 10)

18. Column Renaming

Renaming columns for clarity.

# Rename the column 'old_column' to 'new_column'

data = data.withColumnRenamed("old_column", "new_column")

19. Column Dropping

Removing unnecessary columns.

# Drop the 'unwanted_column' from the DataFrame

data = data.drop("unwanted_column")

20. Column Conversion

Converting a column from one data type to another.

from pyspark.sql.functions import col

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", col("column_string").cast("int"))

Abhishek Agrawal | Azure Data Engineer

21. Type Casting
Changing the type of a column (e.g., from string to integer).

# Convert 'column_string' to an integer and create a new column 'column_int'

data = data.withColumn("column_int", data["column_string"].cast("int"))

22. Duplicate Removal

Removing duplicate rows based on specified columns.

# Remove duplicate rows based on 'column1' and 'column2'

data = data.dropDuplicates(["column1", "column2"])

23. Null Value Removal

Filtering rows with null values in specified columns.

# Filter rows where the 'column' is not null

cleaned_data = data.filter(data["column"].isNotNull())

24. Windowing Functions

Using window functions to rank or aggregate data.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification partitioned by 'category' and ordered by 'value'

window_spec = Window.partitionBy("category").orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

25. Column Combination
Combining multiple columns into one.

from pyspark.sql.functions import concat

# Concatenate 'first_name' and 'last_name' columns to create 'full_name'

data = data.withColumn("full_name", concat(data["first_name"], data["last_name"])
)

26. Cumulative Sum

Calculating a running total of a column.

from pyspark.sql.window import Window

from pyspark.sql.functions import sum

# Define a window specification ordered by 'date' with an unbounded

preceding frame
window_spec = Window.orderBy("date").
rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Add a 'cumulative_sum' column that computes the cumulative sum of 'value'

data = data.withColumn("cumulative_sum", sum("value").over(window_spec))

27. Rolling Average

Calculating a moving average over a window of rows.

from pyspark.sql.window import Window

from pyspark.sql.functions import avg

window_spec = Window.orderBy("date").rowsBetween(-2, 2)

data = data.withColumn("rolling_avg", avg("value").over(window_spec))

Abhishek Agrawal | Azure Data Engineer

28. Value Mapping
Mapping values of a column to new values.

from pyspark.sql.functions import when

# Map 'value' column: set 'mapped_column' to 'A' if 'value' is 1, otherwise 'B'

data = data.withColumn("mapped_column", when(data["value"] == 1, "A").
otherwise("B"))

29. Subsetting Columns

Calculating a moving average over a window of rows.

Selecting only a subset of columns from the dataset.

30. Column Operations

Performing arithmetic operations on columns.

# Create a new column 'new_column' as the sum of 'value1' and 'value2'

data = data.withColumn("new_column", data["value1"] + data["value2"])

31. String Splitting

Splitting a string column into multiple columns based on a delimiter.

from pyspark.sql.functions import split

# Split the values in 'column' by a comma and store the result in 'split_column'
data = data.withColumn("split_column", split(data["column"], ","))

Abhishek Agrawal | Azure Data Engineer

32. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.

from pyspark.sql.functions import explode

# Flatten the array or map in 'nested_column' into multiple rows in

'flattened_column'
data = data.withColumn("flattened_column", explode(data["nested_column"]))

33. Sampling Data

Taking a random sample of the data.

# Sample 10% of the data

sampled_data = data.sample(fraction=0.1)

34. Stripping Whitespace

Removing leading and trailing whitespace from string columns.

from pyspark.sql.functions import trim

# Remove leading and trailing spaces from 'string_column' and create

'trimmed_column'
data = data.withColumn("trimmed_column", trim(data["string_column"]))

Abhishek Agrawal | Azure Data Engineer

35. String Replacing
Replacing substrings within a string column.

from pyspark.sql.functions import regexp_replace

# Replace occurrences of 'old_value' with 'new_value' in 'text_column' and

create 'updated_column'
data = data.withColumn("updated_column", regexp_replace(data["text_column"],
"old_value", "new_value"))

36. Date Difference

Calculating the difference between two date columns.

from pyspark.sql.functions import datediff

# Calculate the difference in days between 'end_date' and 'start_date', and

create 'date_diff' column
data = data.withColumn("date_diff", datediff(data["end_date"],
data["start_date"]))

37. Window Rank

Ranking rows based on a specific column.

from pyspark.sql.window import Window

from pyspark.sql.functions import rank

# Define a window specification ordered by 'value'

window_spec = Window.orderBy("value")

# Add a 'rank' column based on the window specification

data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer

38. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.

# Group by 'category' and calculate the sum of 'value1' and the average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)

39. Date Truncation

Truncating a date column to a specific unit (e.g., year, month).

from pyspark.sql.functions import trunc

# Truncate the date_column to the beginning of the month and add as a new
column
data = data.withColumn("truncated_date", trunc(data["date_column"], "MM"))

40. Repartitioning Data

Changing the number of partitions for better performance

# Repartition the DataFrame into 4 partitions

data = data.repartition(4)

Abhishek Agrawal | Azure Data Engineer

41. Adding Sequence Numbers
Assigning a unique sequence number to each row.

from pyspark.sql.functions import monotonically_increasing_id

# Add a new column 'row_id' with a unique, monotonically increasing ID

data = data.withColumn("row_id", monotonically_increasing_id())

42. Shuffling Data

Randomly shuffling rows in a dataset.

from pyspark.sql.functions import rand

# Shuffle the DataFrame by ordering rows randomly

shuffled_data = data.orderBy(rand())

43. Array Aggregation

Combining values into an array.

from pyspark.sql.functions import collect_list

# Group by 'id' and aggregate 'value' into a list, storing it in a new column
'values_array'
data = data.groupBy("id").agg(collect_list("value").alias("values_array"))

Abhishek Agrawal | Azure Data Engineer

44. Scaling
Scaling features by a specific factor.

from pyspark.ml.feature import QuantileDiscretizer

# Initialize the QuantileDiscretizer with input column, output column, and

number of buckets
scaler = QuantileDiscretizer(inputCol="value", outputCol="scaled_value",
numBuckets=10)

# Fit the discretizer to the data and transform the DataFrame

scaled_data = scaler.fit(data).transform(data)

45. Bucketing
Grouping continuous data into buckets.

from pyspark.ml.feature import Bucketizer

# Define split points for bucketing

splits = [0, 10, 20, 30, 40, 50]

# Initialize the Bucketizer with splits, input column, and output column
bucketizer = Bucketizer(splits=splits, inputCol="value",
outputCol="bucketed_value")

# Apply the bucketizer transformation to the DataFrame

bucketed_data = bucketizer.transform(data)

Abhishek Agrawal | Azure Data Engineer

46. Boolean Operations
Performing boolean operations on columns.

from pyspark.sql.functions import col

# Add a new column 'is_valid' indicating whether the 'value' column is greater
than 10
data = data.withColumn("is_valid", col("value") > 10)

47. Extracting Substrings

Extracting a portion of a string from a column.

from pyspark.sql.functions import substring

# Add a new column 'substring' containing the first 5 characters of

'text_column'
data = data.withColumn("substring", substring(col("text_column"), 1, 5))

48. JSON Parsing

Parsing JSON data into structured columns.

from pyspark.sql.functions import from_json

# Parse the JSON data in the 'json_column' into a structured column 'json_data'
using the specified schema
data = data.withColumn("json_data", from_json(col("json_column"), schema))

Abhishek Agrawal | Azure Data Engineer

49. String Length
Finding the length of a string column

from pyspark.sql.functions import length

# Add a new column 'string_length' containing the length of the strings in

'text_column'
data = data.withColumn("string_length", length(col("text_column")))

50. Row-wise Operations

Applying row-wise functions to a dataset by applying a custom function
to a column using a User-Defined Function (UDF).

from pyspark.sql.functions import udf

from pyspark.sql.types import IntegerType

# Define a function to add 2 to the input value

def add_two(value):
return value + 2

# Register the function as a UDF

add_two_udf = udf(add_two, IntegerType())

# Apply the UDF to the 'value' column and create a new column
'incremented_value'
data = data.withColumn("incremented_value", add_two_udf(col("value")))

Abhishek Agrawal | Azure Data Engineer

Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

Data Cleaning - Cheatsheet
100% (2)
Data Cleaning - Cheatsheet
8 pages
ETL Processes Using PySpark
67% (3)
ETL Processes Using PySpark
7 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Module 3 Conditional Statements and Loops
50% (2)
Module 3 Conditional Statements and Loops
5 pages
DP-203T00 Microsoft Azure Data Engineering-03
No ratings yet
DP-203T00 Microsoft Azure Data Engineering-03
21 pages
HCM Data Loader Users Guide R10
No ratings yet
HCM Data Loader Users Guide R10
71 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
EDA Python for Data Analsis
No ratings yet
EDA Python for Data Analsis
10 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
journal
No ratings yet
journal
47 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Py Spark
No ratings yet
Py Spark
7 pages
Pyspark SQL and DataFrames
No ratings yet
Pyspark SQL and DataFrames
6 pages
Pyspark_12_questions
No ratings yet
Pyspark_12_questions
8 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Spark Test Que
No ratings yet
Spark Test Que
3 pages
Top 100 Pyspark Functions for Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions for Data Engineers 1738131847
30 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
Databricks vs SQL Cheat Sheet
No ratings yet
Databricks vs SQL Cheat Sheet
11 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Pyspark coding questions from StrataScratch platform
No ratings yet
Pyspark coding questions from StrataScratch platform
23 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
SQL Cheat Sheet Python
No ratings yet
SQL Cheat Sheet Python
1 page
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
SQL_ &_PYSPAK
No ratings yet
SQL_ &_PYSPAK
6 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Spark All Optimizations & Code
No ratings yet
Spark All Optimizations & Code
25 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Spark optimisation
No ratings yet
Spark optimisation
7 pages
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
14 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
No ratings yet
⚠️ TCS Rejected Many Due to Weak PySpark Logic!?
7 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Window Functions in SQL and PySpark
No ratings yet
Window Functions in SQL and PySpark
5 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Partition Pruning
No ratings yet
Partition Pruning
2 pages
EDA with Pandas
No ratings yet
EDA with Pandas
8 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Data Cleaning Cheat Sheet
No ratings yet
Data Cleaning Cheat Sheet
2 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
NgRx SignalStore: An effortless solution for state management
From Everand
NgRx SignalStore: An effortless solution for state management
Abdelfattah Ragab
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
From Everand
MCTS 70-515 Exam: Web Applications Development with Microsoft .NET Framework 4 (Exam Prep)
Eddie Vi
4/5 (1)
The Bcs Professional Examinations BCS Level 5 Diploma in IT: The British Computer Society
No ratings yet
The Bcs Professional Examinations BCS Level 5 Diploma in IT: The British Computer Society
3 pages
CoreJava iNetSolv
67% (9)
CoreJava iNetSolv
191 pages
Error DDL Source-BSAD - DDL Could Not Be Activated
No ratings yet
Error DDL Source-BSAD - DDL Could Not Be Activated
2 pages
Codingcompiler Com Python Coding Interview Questions Answers
No ratings yet
Codingcompiler Com Python Coding Interview Questions Answers
20 pages
Computer 6 Reviewer
No ratings yet
Computer 6 Reviewer
6 pages
Interrupt System in 8086
No ratings yet
Interrupt System in 8086
21 pages
Last Exception
No ratings yet
Last Exception
2 pages
Programming Real-Time Systems With C/C++ and POSIX: Michael González Harbour
No ratings yet
Programming Real-Time Systems With C/C++ and POSIX: Michael González Harbour
9 pages
22 ESC145 Set 2
No ratings yet
22 ESC145 Set 2
2 pages
Mongodb Schema Validation
No ratings yet
Mongodb Schema Validation
8 pages
Analysis
No ratings yet
Analysis
149 pages
08.ABAP Dialog Programming Overview
100% (1)
08.ABAP Dialog Programming Overview
70 pages
C Pattern
No ratings yet
C Pattern
6 pages
Aula Pulp
No ratings yet
Aula Pulp
64 pages
L3_Java_Exception-Handling
No ratings yet
L3_Java_Exception-Handling
6 pages
CSE 477: Introduction To Computer Security: Lecture - 14: Operating System Security - 2
No ratings yet
CSE 477: Introduction To Computer Security: Lecture - 14: Operating System Security - 2
34 pages
(PDF) Python Learn Tutorial For Beginners
No ratings yet
(PDF) Python Learn Tutorial For Beginners
6 pages
ME8793 Process Planning and Cost EStimation UNIT 1 Notes
No ratings yet
ME8793 Process Planning and Cost EStimation UNIT 1 Notes
28 pages
Connect Facebook Net Signals Config 1548069268553425 V 2!9!89&r Stable
No ratings yet
Connect Facebook Net Signals Config 1548069268553425 V 2!9!89&r Stable
43 pages
Microprocessor Lab Manual
No ratings yet
Microprocessor Lab Manual
43 pages
LIST OF IMIT SAP Students
No ratings yet
LIST OF IMIT SAP Students
10 pages
MCQ3 15th
No ratings yet
MCQ3 15th
71 pages
Distributed System 1 PDF
100% (1)
Distributed System 1 PDF
9 pages
III Cse Cs2305 Programming Paradigms
No ratings yet
III Cse Cs2305 Programming Paradigms
1 page
COMP 1537 - Week 8 - DOM Creating, and Events
No ratings yet
COMP 1537 - Week 8 - DOM Creating, and Events
14 pages
Scopes and Data Types: CMSC 124
No ratings yet
Scopes and Data Types: CMSC 124
34 pages
Windows Azure Table May 2009
No ratings yet
Windows Azure Table May 2009
38 pages
Introduction To VB - Net 2
100% (1)
Introduction To VB - Net 2
84 pages