Skip to main content

Window Functions in PySpark DataFrames

At NeoMart, analysts often need more than simple aggregations.
They want rankings, running totals, moving averages, and session-level analytics. For example:

  • Rank products by revenue per category
  • Calculate each customer’s cumulative spend
  • Determine the last purchase date for loyalty scoring

This is where window functions come in.


Why Window Functions Matter

  • Allow calculations over partitions of data
  • Keep row-level detail while performing aggregations
  • Essential for ranking, cumulative metrics, and analytics features
  • Perfect for dashboards, reporting, and feature engineering in ML

Unlike groupBy, which reduces rows, window functions retain all rows.


1. Import Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, sum, avg, rank, desc

2. Ranking Rows: row_number()

window_spec = Window.partitionBy("category").orderBy(desc("revenue"))

df.withColumn("rank", row_number().over(window_spec)).show()

Story Example

NeoMart wants top 3 products per category for promotion campaigns.


3. Cumulative / Running Totals

window_spec = Window.partitionBy("customer_id").orderBy("date")

df.withColumn("running_total", sum("amount").over(window_spec)).show()

This tracks each customer’s cumulative spend over time, essential for loyalty programs.


4. Moving Averages

window_spec = Window.orderBy("date").rowsBetween(-3, 0)

df.withColumn("moving_avg", avg("amount").over(window_spec)).show()

Use Case

  • Smooth out daily sales volatility
  • Detect trends or anomalies

5. Ranking With Ties: rank() vs dense_rank()

window_spec = Window.partitionBy("category").orderBy(desc("revenue"))

df.withColumn("rank", rank().over(window_spec)).show()
df.withColumn("dense_rank", dense_rank().over(window_spec)).show()
  • rank() → leaves gaps for ties
  • dense_rank() → no gaps

Useful for leaderboards or top-n product ranking.


Summary

Window functions in PySpark:

  • Operate over partitions without reducing rows
  • Enable rankings, cumulative sums, moving averages
  • Are essential for BI, ML features, and advanced analytics
  • Retain the flexibility of DataFrame operations

Next, we’ll cover Handling Missing Data — Drop, Fill, Replace, to clean and prepare datasets for analysis and modeling.