0% found this document useful (0 votes)
24 views

PySpark Transformations

Uploaded by

swamisamarth934
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

PySpark Transformations

Uploaded by

swamisamarth934
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Can you explain the

different Transformation
you’ve done in your
project?

Be Prepared
Learn 50 Pyspark
Transformation
to Stand Out
Abhishek Agrawal
Azure Data Engineer
1. Normalization
Scaling data to a range between 0 and 1.

from pyspark.ml.feature import MinMaxScaler


scaler = MinMaxScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

2. Standardization
Transforming data to have zero mean and unit variance

from pyspark.ml.feature import StandardScaler


scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = scaler.fit(data).transform(data)

3. Log Transformation
Applying a logarithmic transformation to handle skewed data.

from pyspark.ml.feature import StandardScaler


scaler = StandardScaler(inputCol="features", outputCol="scaled_features")
scaled_data = sfrom pyspark.ml.feature import StandardScaler

# Initialize the StandardScaler


scaler = StandardScaler(
inputCol="features",
outputCol="scaled_features"
)

# Fit the scaler to the dataset and transform the data


scaled_data = scaler.fit(data).transform(data)
caler.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer


4. Binning
Grouping continuous values into discrete bins.

from pyspark.sql.functions import when

# Add a new column 'bin_column' based on conditions


data = data.withColumn(
"bin_column",
when(data["value"] < 10, "Low")
.when(data["value"] < 20, "Medium")
.otherwise("High")
)

5. One-Hot Encoding
Converting categorical variables into binary columns.

from pyspark.ml.feature import OneHotEncoder, StringIndexer

# Step 1: Indexing the categorical column


indexer = StringIndexer(inputCol="category", outputCol="category_index")
indexed_data = indexer.fit(data).transform(data)

# Step 2: One-hot encoding the indexed column


encoder = OneHotEncoder(inputCol="category_index", outputCol="category_onehot")
encoded_data = encoder.fit(indexed_data).transform(indexed_data)

6. Label Encoding
Converting categorical values into integer labels.

from pyspark.ml.feature import StringIndexer

# Step 1: Create a StringIndexer to index the 'category' column


indexer = StringIndexer(inputCol="category", outputCol="category_index")

# Step 2: Fit the indexer on the data and transform it


indexed_data = indexer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer


7. Pivoting
Pivoting is the process of transforming long-format data (where each row
represents a single observation or record) into wide-format data (where
each column represents a different attribute or category). This
transformation is typically used when you want to turn a categorical
variable into columns and aggregate values accordingly.

# Pivoting the data to create a summary of sales by month for each ID


pivoted_data = data.groupBy("id") \
.pivot("month") \
.agg({"sales": "sum"})
= data.groupBy("id").pivot("month").agg({"sales": "sum"})

8. Unpivoting
Unpivoting is the opposite of pivoting. It transforms wide-format data
(where each column represents a different category or attribute) into
long-format data (where each row represents a single observation). This
is useful when you want to turn column headers back into values.

# Unpivoting the data to convert columns into rows


unpivoted_data = data.selectExpr(
"id",
"stack(2, 'Jan', Jan, 'Feb', Feb) as (month, sales)"
)

9. Aggregation
Summarizing data by applying functions like sum(), avg(), etc.

# Aggregating data by category to compute the sum of values


aggregated_data = data.groupBy("category") \
.agg({"value": "sum"})

Abhishek Agrawal | Azure Data Engineer


10. Feature Extraction
Extracting useful features from raw data.

from pyspark.sql.functions import year, month, dayofmonth

# Add year, month, and day columns to the DataFrame


data = (
data
.withColumn("year", year(data["timestamp"]))
.withColumn("month", month(data["timestamp"]))
.withColumn("day", dayofmonth(data["timestamp"]))
)

11. Outlier Removal


Filtering out extreme values (outliers).

# Filter rows where the 'value' column is less than 1000


filtered_data = data.filter(data["value"] < 1000)

12. Data Imputation


Filling missing values with the mean or median.

from pyspark.ml.feature import Imputer

# Create an Imputer instance


imputer = Imputer(inputCols=["column"], outputCols=["imputed_column"])

# Fit the imputer model and transform the data


imputed_data = imputer.fit(data).transform(data)

Abhishek Agrawal | Azure Data Engineer


13. Date/Time Parsing
Converting string to datetime objects.

from pyspark.sql.functions import to_timestamp

# Convert the 'date_string' column to a timestamp with the specified format


data = data.withColumn("timestamp", to_timestamp(data["date_string"], "yyyy-MM-dd"))

14. Text Transformation


Converting text to lowercase.

from pyspark.sql.functions import lower

# Convert the text in 'text_column' to lowercase and store it in a new column


data = data.withColumn("lowercase_text", lower(data["text_column"]))

15. Data Merging


Combining two datasets based on a common column.

# Perform an inner join between data1 and data2 on the 'id' column
merged_data = data1.join(data2, data1["id"] == data2["id"], "inner")

16. Data Joining


Joining data using inner, left, or right joins.

# Perform a left join between data1 and data2 on the 'id' column
joined_data = data1.join(data2, on="id", how="left")

Abhishek Agrawal | Azure Data Engineer


17. Filtering Rows
Filtering rows based on a condition.

# Filter rows where the 'value' column is greater than 10


filtered_data = data.filter(data["value"] > 10)

18. Column Renaming


Renaming columns for clarity.

# Rename the column 'old_column' to 'new_column'


data = data.withColumnRenamed("old_column", "new_column")

19. Column Dropping


Removing unnecessary columns.

# Drop the 'unwanted_column' from the DataFrame


data = data.drop("unwanted_column")

20. Column Conversion


Converting a column from one data type to another.

from pyspark.sql.functions import col

# Convert 'column_string' to an integer and create a new column 'column_int'


data = data.withColumn("column_int", col("column_string").cast("int"))

Abhishek Agrawal | Azure Data Engineer


21. Type Casting
Changing the type of a column (e.g., from string to integer).

# Convert 'column_string' to an integer and create a new column 'column_int'


data = data.withColumn("column_int", data["column_string"].cast("int"))

22. Duplicate Removal


Removing duplicate rows based on specified columns.

# Remove duplicate rows based on 'column1' and 'column2'


data = data.dropDuplicates(["column1", "column2"])

23. Null Value Removal


Filtering rows with null values in specified columns.

# Filter rows where the 'column' is not null


cleaned_data = data.filter(data["column"].isNotNull())

24. Windowing Functions


Using window functions to rank or aggregate data.

from pyspark.sql.window import Window


from pyspark.sql.functions import rank

# Define a window specification partitioned by 'category' and ordered by 'value'


window_spec = Window.partitionBy("category").orderBy("value")

# Add a 'rank' column based on the window specification


data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer


25. Column Combination
Combining multiple columns into one.

from pyspark.sql.functions import concat

# Concatenate 'first_name' and 'last_name' columns to create 'full_name'


data = data.withColumn("full_name", concat(data["first_name"], data["last_name"])
)

26. Cumulative Sum


Calculating a running total of a column.

from pyspark.sql.window import Window


from pyspark.sql.functions import sum

# Define a window specification ordered by 'date' with an unbounded


preceding frame
window_spec = Window.orderBy("date").
rowsBetween(Window.unboundedPreceding, Window.currentRow)

# Add a 'cumulative_sum' column that computes the cumulative sum of 'value'


data = data.withColumn("cumulative_sum", sum("value").over(window_spec))

27. Rolling Average


Calculating a moving average over a window of rows.

from pyspark.sql.window import Window


from pyspark.sql.functions import avg

window_spec = Window.orderBy("date").rowsBetween(-2, 2)

data = data.withColumn("rolling_avg", avg("value").over(window_spec))

Abhishek Agrawal | Azure Data Engineer


28. Value Mapping
Mapping values of a column to new values.

from pyspark.sql.functions import when

# Map 'value' column: set 'mapped_column' to 'A' if 'value' is 1, otherwise 'B'


data = data.withColumn("mapped_column", when(data["value"] == 1, "A").
otherwise("B"))

29. Subsetting Columns


Calculating a moving average over a window of rows.

Selecting only a subset of columns from the dataset.

30. Column Operations


Performing arithmetic operations on columns.

# Create a new column 'new_column' as the sum of 'value1' and 'value2'


data = data.withColumn("new_column", data["value1"] + data["value2"])

31. String Splitting


Splitting a string column into multiple columns based on a delimiter.

from pyspark.sql.functions import split

# Split the values in 'column' by a comma and store the result in 'split_column'
data = data.withColumn("split_column", split(data["column"], ","))

Abhishek Agrawal | Azure Data Engineer


32. Data Flattening
Flattening nested structures (e.g., JSON) into a tabular format.

from pyspark.sql.functions import explode

# Flatten the array or map in 'nested_column' into multiple rows in


'flattened_column'
data = data.withColumn("flattened_column", explode(data["nested_column"]))

33. Sampling Data


Taking a random sample of the data.

# Sample 10% of the data


sampled_data = data.sample(fraction=0.1)

34. Stripping Whitespace


Removing leading and trailing whitespace from string columns.

from pyspark.sql.functions import trim

# Remove leading and trailing spaces from 'string_column' and create


'trimmed_column'
data = data.withColumn("trimmed_column", trim(data["string_column"]))

Abhishek Agrawal | Azure Data Engineer


35. String Replacing
Replacing substrings within a string column.

from pyspark.sql.functions import regexp_replace

# Replace occurrences of 'old_value' with 'new_value' in 'text_column' and


create 'updated_column'
data = data.withColumn("updated_column", regexp_replace(data["text_column"],
"old_value", "new_value"))

36. Date Difference


Calculating the difference between two date columns.

from pyspark.sql.functions import datediff

# Calculate the difference in days between 'end_date' and 'start_date', and


create 'date_diff' column
data = data.withColumn("date_diff", datediff(data["end_date"],
data["start_date"]))

37. Window Rank


Ranking rows based on a specific column.

from pyspark.sql.window import Window


from pyspark.sql.functions import rank

# Define a window specification ordered by 'value'


window_spec = Window.orderBy("value")

# Add a 'rank' column based on the window specification


data = data.withColumn("rank", rank().over(window_spec))

Abhishek Agrawal | Azure Data Engineer


38. Multi-Column Aggregation
Performing multiple aggregation operations on different columns.

# Group by 'category' and calculate the sum of 'value1' and the average of
'value2'
aggregated_data = data.groupBy("category").agg(
{"value1": "sum", "value2": "avg"}
)

39. Date Truncation


Truncating a date column to a specific unit (e.g., year, month).

from pyspark.sql.functions import trunc

# Truncate the date_column to the beginning of the month and add as a new
column
data = data.withColumn("truncated_date", trunc(data["date_column"], "MM"))

40. Repartitioning Data


Changing the number of partitions for better performance

# Repartition the DataFrame into 4 partitions


data = data.repartition(4)

Abhishek Agrawal | Azure Data Engineer


41. Adding Sequence Numbers
Assigning a unique sequence number to each row.

from pyspark.sql.functions import monotonically_increasing_id

# Add a new column 'row_id' with a unique, monotonically increasing ID


data = data.withColumn("row_id", monotonically_increasing_id())

42. Shuffling Data


Randomly shuffling rows in a dataset.

from pyspark.sql.functions import rand

# Shuffle the DataFrame by ordering rows randomly


shuffled_data = data.orderBy(rand())

43. Array Aggregation


Combining values into an array.

from pyspark.sql.functions import collect_list

# Group by 'id' and aggregate 'value' into a list, storing it in a new column
'values_array'
data = data.groupBy("id").agg(collect_list("value").alias("values_array"))

Abhishek Agrawal | Azure Data Engineer


44. Scaling
Scaling features by a specific factor.

from pyspark.ml.feature import QuantileDiscretizer

# Initialize the QuantileDiscretizer with input column, output column, and


number of buckets
scaler = QuantileDiscretizer(inputCol="value", outputCol="scaled_value",
numBuckets=10)

# Fit the discretizer to the data and transform the DataFrame


scaled_data = scaler.fit(data).transform(data)

45. Bucketing
Grouping continuous data into buckets.

from pyspark.ml.feature import Bucketizer

# Define split points for bucketing


splits = [0, 10, 20, 30, 40, 50]

# Initialize the Bucketizer with splits, input column, and output column
bucketizer = Bucketizer(splits=splits, inputCol="value",
outputCol="bucketed_value")

# Apply the bucketizer transformation to the DataFrame


bucketed_data = bucketizer.transform(data)

Abhishek Agrawal | Azure Data Engineer


46. Boolean Operations
Performing boolean operations on columns.

from pyspark.sql.functions import col

# Add a new column 'is_valid' indicating whether the 'value' column is greater
than 10
data = data.withColumn("is_valid", col("value") > 10)

47. Extracting Substrings


Extracting a portion of a string from a column.

from pyspark.sql.functions import substring

# Add a new column 'substring' containing the first 5 characters of


'text_column'
data = data.withColumn("substring", substring(col("text_column"), 1, 5))

48. JSON Parsing


Parsing JSON data into structured columns.

from pyspark.sql.functions import from_json

# Parse the JSON data in the 'json_column' into a structured column 'json_data'
using the specified schema
data = data.withColumn("json_data", from_json(col("json_column"), schema))

Abhishek Agrawal | Azure Data Engineer


49. String Length
Finding the length of a string column

from pyspark.sql.functions import length

# Add a new column 'string_length' containing the length of the strings in


'text_column'
data = data.withColumn("string_length", length(col("text_column")))

50. Row-wise Operations


Applying row-wise functions to a dataset by applying a custom function
to a column using a User-Defined Function (UDF).

from pyspark.sql.functions import udf


from pyspark.sql.types import IntegerType

# Define a function to add 2 to the input value


def add_two(value):
return value + 2

# Register the function as a UDF


add_two_udf = udf(add_two, IntegerType())

# Apply the UDF to the 'value' column and create a new column
'incremented_value'
data = data.withColumn("incremented_value", add_two_udf(col("value")))

Abhishek Agrawal | Azure Data Engineer


Follow for more
content like this

Abhishek Agrawal
Azure Data Engineer

You might also like