0% found this document useful (0 votes)
3 views

Window_Functions_Spark

Window functions in Apache Spark allow for calculations across a set of rows while retaining the original data. Key components include partitioning, ordering, and window specifications, enabling operations like ranking, cumulative sums, and lag/lead functions. These functions have real-world applications across various industries, including finance, retail, healthcare, and human resources.

Uploaded by

keshav9d9reddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Window_Functions_Spark

Window functions in Apache Spark allow for calculations across a set of rows while retaining the original data. Key components include partitioning, ordering, and window specifications, enabling operations like ranking, cumulative sums, and lag/lead functions. These functions have real-world applications across various industries, including finance, retail, healthcare, and human resources.

Uploaded by

keshav9d9reddi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Window Functions in Apache Spark

Window functions operate on a set of rows and return a single value for each row. They are different from standard
aggregation functions because they can retain the original row and still produce aggregated values. They are useful
for running calculations across a specified range or window of data.

Key Components of Window Functions:


- Partition By: Defines how to split the data into partitions.
- Order By: Defines the order of rows within each partition.
- Window Specification: Defines the frame for the window function.

Creating a DataFrame:
Let's start by creating a DataFrame with a sample dataset spanning two years, 2019 and 2020.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc
from pyspark.sql.window import Window

# Initialize Spark session


spark = SparkSession.builder.master("local").appName("WindowFunctions").getOrCreate()

data = [
("2019", "Hamilton", 413),
("2019", "Bottas", 326),
("2019", "Verstappen", 278),
("2019", "Vettel", 240),
("2020", "Hamilton", 347),
("2020", "Bottas", 223),
("2020", "Verstappen", 214),
("2020", "Vettel", 33),
]

# Creating DataFrame
columns = ["RaceYear", "DriverName", "TotalPoints"]
df = spark.createDataFrame(data, columns)
df.show()
```

Applying Window Functions:

1. Ranking Drivers by Total Points

```python
from pyspark.sql.functions import rank

# Define window specification


windowSpec = Window.partitionBy("RaceYear").orderBy(desc("TotalPoints"))

# Apply rank function


df.withColumn("Rank", rank().over(windowSpec)).show()
```

2. Calculating Cumulative Sum

```python
from pyspark.sql.functions import sum

# Apply cumulative sum function


df.withColumn("CumulativePoints", sum("TotalPoints").over(windowSpec)).show()
```

3. Using Lag and Lead Functions

```python
from pyspark.sql.functions import lag, lead

# Apply lag function


df.withColumn("PreviousPoints", lag("TotalPoints", 1).over(windowSpec)).show()

# Apply lead function


df.withColumn("NextPoints", lead("TotalPoints", 1).over(windowSpec)).show()
```

4. Percent Rank Function

```python
from pyspark.sql.functions import percent_rank

# Apply percent rank function


df.withColumn("PercentRank", percent_rank().over(windowSpec)).show()
```

### Real-world Applications of Window Functions

1. **Financial Industry** - Calculating Moving Average

```python
from pyspark.sql.functions import avg

windowSpec = Window.partitionBy("StockSymbol").orderBy("Date").rowsBetween(-4, 0)
df.withColumn("MovingAvg", avg("StockPrice").over(windowSpec)).show()
```

2. **Retail Industry** - Ranking Products by Sales

```python
from pyspark.sql.functions import dense_rank

windowSpec = Window.partitionBy("Category").orderBy(desc("TotalSales"))
df.withColumn("Rank", dense_rank().over(windowSpec)).show()
```

3. **Healthcare Industry** - Running Total of Patients


```python
from pyspark.sql.functions import sum

windowSpec = Window.orderBy("AdmissionDate").rowsBetween(Window.unboundedPreceding, Window.currentRow)


df.withColumn("RunningTotalPatients", sum("Patients").over(windowSpec)).show()
```

4. **Telecommunications Industry** - Churn Prediction

```python
windowSpec = Window.partitionBy("CustomerID").orderBy(desc("CallDate")).rowsBetween(-4, 0)
df.withColumn("AvgCallDuration", avg("CallDuration").over(windowSpec)).show()
```

5. **Human Resources** - Employee Performance Analysis

```python
from pyspark.sql.functions import row_number

windowSpec = Window.partitionBy("Department").orderBy(desc("PerformanceScore"))
df.withColumn("Rank", row_number().over(windowSpec)).show()
```

6. **Sales and Marketing** - Calculating Sales Growth Rate

```python
from pyspark.sql.functions import lag, col

windowSpec = Window.partitionBy("ProductID").orderBy("SalesDate")

df = df.withColumn("PreviousSales", lag("SalesAmount").over(windowSpec))
df = df.withColumn("GrowthRate", (col("SalesAmount") - col("PreviousSales")) / col("PreviousSales"))

df.show()
```

### Conclusion
Window functions in Apache Spark provide powerful capabilities for complex data analysis. By partitioning and ordering
data, you can perform various calculations like ranking, cumulative sums, and more, which are crucial for many
data processing tasks.

You might also like