Pivot & Unpivot in PySpark DataFrames — Reshape Data the Smart Way
At NeoMart, analysts ask questions like:
- “Show total sales per category for each month.”
- “Convert event names into columns to simplify dashboards.”
- “Transform wide data into tall data for ML models.”
These tasks require reshaping data — turning rows into columns or columns back into rows.
PySpark offers two powerful tools:
✔ pivot()
✔ stack() (for unpivot)
Used well, they turn messy transactional data into analytics-friendly tables.
1. Sample Dataset
We’ll work with a tiny NeoMart sales dataset:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
data = [
("2025-01", "Electronics", 12000),
("2025-01", "Grocery", 4000),
("2025-02", "Electronics", 15000),
("2025-02", "Grocery", 5000),
("2025-02", "Fashion", 3000)
]
df = spark.createDataFrame(data, ["month", "category", "revenue"])
df.show()
Input Data
| month | category | revenue |
|---|---|---|
| 2025-01 | Electronics | 12000 |
| 2025-01 | Grocery | 4000 |
| 2025-02 | Electronics | 15000 |
| 2025-02 | Grocery | 5000 |
| 2025-02 | Fashion | 3000 |
2. Pivot — Convert Rows to Columns
Goal: Turn categories into columns.
df_pivot = df.groupBy("month").pivot("category").sum("revenue")
df_pivot.show()
Output
| month | Electronics | Fashion | Grocery |
|---|---|---|---|
| 2025-01 | 12000 | null | 4000 |
| 2025-02 | 15000 | 3000 | 5000 |
Story
NeoMart’s dashboard team loves this pivoted table because each product category becomes a column — ready for BI tools.
3. Pivot with Known Values (More Efficient)
Specifying expected pivot columns improves performance.
categories = ["Electronics", "Fashion", "Grocery"]
df_pivot2 = df.groupBy("month") \
.pivot("category", categories) \
.sum("revenue")
df_pivot2.show()
Same output, faster execution.
4. Pivot Using Spark SQL
SELECT *
FROM (
SELECT month, category, revenue FROM sales
)
PIVOT (
SUM(revenue) FOR category IN ('Electronics', 'Fashion', 'Grocery')
)
5. Unpivot — Convert Columns Back to Rows (Tall Format)
PySpark doesn’t have a built-in unpivot() function.
But there are 2 reliable methods:
Method 1 — Using stack()
df_unpivot = df_pivot.select(
"month",
F.expr("stack(3, 'Electronics', Electronics, 'Fashion', Fashion, 'Grocery', Grocery) as (category, revenue)")
)
df_unpivot.show()
Output
| month | category | revenue |
|---|---|---|
| 2025-01 | Electronics | 12000 |
| 2025-01 | Fashion | null |
| 2025-01 | Grocery | 4000 |
| 2025-02 | Electronics | 15000 |
| 2025-02 | Fashion | 3000 |
| 2025-02 | Grocery | 5000 |
Story
NeoMart’s ML team prefers unpivoted tall format because algorithms expect features in rows — not scattered across columns.
6. Generic Unpivot for Any DataFrame
def unpivot(df, by, cols):
expr = "stack({0}, {1}) as (column, value)".format(
len(cols),
", ".join(["'%s', %s" % (c, c) for c in cols])
)
return df.select(by + [F.expr(expr)])
Usage:
unpivot(df_pivot, ["month"], ["Electronics", "Fashion", "Grocery"]).show()
7. Unpivot Using Arrays + explode()
More flexible, works with any number of columns.
df_unp2 = df_pivot.select(
"month",
F.explode(
F.array(
F.struct(F.lit("Electronics").alias("category"), F.col("Electronics").alias("revenue")),
F.struct(F.lit("Fashion").alias("category"), F.col("Fashion").alias("revenue")),
F.struct(F.lit("Grocery").alias("category"), F.col("Grocery").alias("revenue"))
)
).alias("row")
).select("month", "row.*")
df_unp2.show()
8. Best Practices for Pivot & Unpivot
✔ Define known pivot values for performance ✔ Use Spark SQL for complex pivoting logic ✔ Beware of high-cardinality pivots (thousands of columns) ✔ Cache results when pivoting large DataFrames ✔ Use unpivot for ML preprocessing & analytics normalization ✔ Validate nulls introduced during pivot
Summary
You now understand how to:
- Pivot: turn rows into columns
- Unpivot: turn columns back into rows
- Use SQL and DataFrame APIs
- Handle missing categories
- Optimize pivot performance at scale
NeoMart’s reporting, analytics, and ML pipelines rely heavily on these reshaping techniques — and now you can build the same power into your PySpark workflows.
Next Topic → Join Optimization Techniques — Broadcast, Skew Handling & Efficient Joins