Skip to main content

Pivot & Unpivot in PySpark DataFrames — Reshape Data the Smart Way

At NeoMart, analysts ask questions like:

  • “Show total sales per category for each month.”
  • “Convert event names into columns to simplify dashboards.”
  • “Transform wide data into tall data for ML models.”

These tasks require reshaping data — turning rows into columns or columns back into rows.

PySpark offers two powerful tools:

pivot()
stack() (for unpivot)

Used well, they turn messy transactional data into analytics-friendly tables.


1. Sample Dataset

We’ll work with a tiny NeoMart sales dataset:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

data = [
("2025-01", "Electronics", 12000),
("2025-01", "Grocery", 4000),
("2025-02", "Electronics", 15000),
("2025-02", "Grocery", 5000),
("2025-02", "Fashion", 3000)
]

df = spark.createDataFrame(data, ["month", "category", "revenue"])
df.show()

Input Data

monthcategoryrevenue
2025-01Electronics12000
2025-01Grocery4000
2025-02Electronics15000
2025-02Grocery5000
2025-02Fashion3000

2. Pivot — Convert Rows to Columns

Goal: Turn categories into columns.

df_pivot = df.groupBy("month").pivot("category").sum("revenue")
df_pivot.show()

Output

monthElectronicsFashionGrocery
2025-0112000null4000
2025-021500030005000

Story

NeoMart’s dashboard team loves this pivoted table because each product category becomes a column — ready for BI tools.


3. Pivot with Known Values (More Efficient)

Specifying expected pivot columns improves performance.

categories = ["Electronics", "Fashion", "Grocery"]

df_pivot2 = df.groupBy("month") \
.pivot("category", categories) \
.sum("revenue")

df_pivot2.show()

Same output, faster execution.


4. Pivot Using Spark SQL

SELECT *
FROM (
SELECT month, category, revenue FROM sales
)
PIVOT (
SUM(revenue) FOR category IN ('Electronics', 'Fashion', 'Grocery')
)

5. Unpivot — Convert Columns Back to Rows (Tall Format)

PySpark doesn’t have a built-in unpivot() function. But there are 2 reliable methods:

Method 1 — Using stack()

df_unpivot = df_pivot.select(
"month",
F.expr("stack(3, 'Electronics', Electronics, 'Fashion', Fashion, 'Grocery', Grocery) as (category, revenue)")
)

df_unpivot.show()

Output

monthcategoryrevenue
2025-01Electronics12000
2025-01Fashionnull
2025-01Grocery4000
2025-02Electronics15000
2025-02Fashion3000
2025-02Grocery5000

Story

NeoMart’s ML team prefers unpivoted tall format because algorithms expect features in rows — not scattered across columns.


6. Generic Unpivot for Any DataFrame

def unpivot(df, by, cols):
expr = "stack({0}, {1}) as (column, value)".format(
len(cols),
", ".join(["'%s', %s" % (c, c) for c in cols])
)
return df.select(by + [F.expr(expr)])

Usage:

unpivot(df_pivot, ["month"], ["Electronics", "Fashion", "Grocery"]).show()

7. Unpivot Using Arrays + explode()

More flexible, works with any number of columns.

df_unp2 = df_pivot.select(
"month",
F.explode(
F.array(
F.struct(F.lit("Electronics").alias("category"), F.col("Electronics").alias("revenue")),
F.struct(F.lit("Fashion").alias("category"), F.col("Fashion").alias("revenue")),
F.struct(F.lit("Grocery").alias("category"), F.col("Grocery").alias("revenue"))
)
).alias("row")
).select("month", "row.*")

df_unp2.show()

8. Best Practices for Pivot & Unpivot

✔ Define known pivot values for performance ✔ Use Spark SQL for complex pivoting logic ✔ Beware of high-cardinality pivots (thousands of columns) ✔ Cache results when pivoting large DataFrames ✔ Use unpivot for ML preprocessing & analytics normalization ✔ Validate nulls introduced during pivot


Summary

You now understand how to:

  • Pivot: turn rows into columns
  • Unpivot: turn columns back into rows
  • Use SQL and DataFrame APIs
  • Handle missing categories
  • Optimize pivot performance at scale

NeoMart’s reporting, analytics, and ML pipelines rely heavily on these reshaping techniques — and now you can build the same power into your PySpark workflows.


Next Topic → Join Optimization Techniques — Broadcast, Skew Handling & Efficient Joins