Window_Functions_Spark
Window_Functions_Spark
Window functions operate on a set of rows and return a single value for each row. They are different from standard
aggregation functions because they can retain the original row and still produce aggregated values. They are useful
for running calculations across a specified range or window of data.
Creating a DataFrame:
Let's start by creating a DataFrame with a sample dataset spanning two years, 2019 and 2020.
```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, desc
from pyspark.sql.window import Window
data = [
("2019", "Hamilton", 413),
("2019", "Bottas", 326),
("2019", "Verstappen", 278),
("2019", "Vettel", 240),
("2020", "Hamilton", 347),
("2020", "Bottas", 223),
("2020", "Verstappen", 214),
("2020", "Vettel", 33),
]
# Creating DataFrame
columns = ["RaceYear", "DriverName", "TotalPoints"]
df = spark.createDataFrame(data, columns)
df.show()
```
```python
from pyspark.sql.functions import rank
```python
from pyspark.sql.functions import sum
```python
from pyspark.sql.functions import lag, lead
```python
from pyspark.sql.functions import percent_rank
```python
from pyspark.sql.functions import avg
windowSpec = Window.partitionBy("StockSymbol").orderBy("Date").rowsBetween(-4, 0)
df.withColumn("MovingAvg", avg("StockPrice").over(windowSpec)).show()
```
```python
from pyspark.sql.functions import dense_rank
windowSpec = Window.partitionBy("Category").orderBy(desc("TotalSales"))
df.withColumn("Rank", dense_rank().over(windowSpec)).show()
```
```python
windowSpec = Window.partitionBy("CustomerID").orderBy(desc("CallDate")).rowsBetween(-4, 0)
df.withColumn("AvgCallDuration", avg("CallDuration").over(windowSpec)).show()
```
```python
from pyspark.sql.functions import row_number
windowSpec = Window.partitionBy("Department").orderBy(desc("PerformanceScore"))
df.withColumn("Rank", row_number().over(windowSpec)).show()
```
```python
from pyspark.sql.functions import lag, col
windowSpec = Window.partitionBy("ProductID").orderBy("SalesDate")
df = df.withColumn("PreviousSales", lag("SalesAmount").over(windowSpec))
df = df.withColumn("GrowthRate", (col("SalesAmount") - col("PreviousSales")) / col("PreviousSales"))
df.show()
```
### Conclusion
Window functions in Apache Spark provide powerful capabilities for complex data analysis. By partitioning and ordering
data, you can perform various calculations like ranking, cumulative sums, and more, which are crucial for many
data processing tasks.