PySpark Transformations
PySpark Transformations
different Transformation
you’ve done in your
project?
Be Prepared
Learn 50 Pyspark
Transformation
to Stand Out
Abhishek Agrawal
Azure Data Engineer
1. Normalization
Scaling data to a range between 0 and 1.
2. Standardization
Transforming data to have zero mean and unit variance
3. Log Transformation
Applying a logarithmic transformation to handle skewed data.
5. One-Hot Encoding
Converting categorical variables into binary columns.
6. Label Encoding
Converting categorical values into integer labels.
8. Unpivoting
Unpivoting is the opposite of pivoting. It transforms wide-format data
(where each column represents a different category or attribute) into
long-format data (where each row represents a single observation). This
is useful when you want to turn column headers back into values.
9. Aggregation
Summarizing data by applying functions like sum(), avg(), etc.
# Perform an inner join between data1 and data2 on the 'id' column
merged_data = data1.join(data2, data1["id"] == data2["id"], "inner")
# Perform a left join between data1 and data2 on the 'id' column
joined_data = data1.join(data2, on="id", how="left")
window_spec = Window.orderBy("date").rowsBetween(-2, 2)
# Split the values in 'column' by a comma and store the result in 'split_column'
data = data.withColumn("split_column", split(data["column"], ","))
# Group by 'category' and calculate the sum of 'value1' and the average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)
# Truncate the date_column to the beginning of the month and add as a new
column
data = data.withColumn("truncated_date", trunc(data["date_column"], "MM"))
# Group by 'id' and aggregate 'value' into a list, storing it in a new column
'values_array'
data = data.groupBy("id").agg(collect_list("value").alias("values_array"))
45. Bucketing
Grouping continuous data into buckets.
# Initialize the Bucketizer with splits, input column, and output column
bucketizer = Bucketizer(splits=splits, inputCol="value",
outputCol="bucketed_value")
# Add a new column 'is_valid' indicating whether the 'value' column is greater
than 10
data = data.withColumn("is_valid", col("value") > 10)
# Parse the JSON data in the 'json_column' into a structured column 'json_data'
using the specified schema
data = data.withColumn("json_data", from_json(col("json_column"), schema))
# Apply the UDF to the 'value' column and create a new column
'incremented_value'
data = data.withColumn("incremented_value", add_two_udf(col("value")))
Abhishek Agrawal
Azure Data Engineer