Pivot & Unpivot in PySpark DataFrames — Reshape Data the Smart Way

At NeoMart, analysts ask questions like:

“Show total sales per category for each month.”
“Convert event names into columns to simplify dashboards.”
“Transform wide data into tall data for ML models.”

These tasks require reshaping data — turning rows into columns or columns back into rows.

PySpark offers two powerful tools:

✔ pivot()
✔ stack() (for unpivot)

Used well, they turn messy transactional data into analytics-friendly tables.

1. Sample Dataset

We’ll work with a tiny NeoMart sales dataset:

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

data = [
    ("2025-01", "Electronics", 12000),
    ("2025-01", "Grocery", 4000),
    ("2025-02", "Electronics", 15000),
    ("2025-02", "Grocery", 5000),
    ("2025-02", "Fashion", 3000)
]

df = spark.createDataFrame(data, ["month", "category", "revenue"])
df.show()

Input Data

month	category	revenue
2025-01	Electronics	12000
2025-01	Grocery	4000
2025-02	Electronics	15000
2025-02	Grocery	5000
2025-02	Fashion	3000

2. Pivot — Convert Rows to Columns

Goal: Turn categories into columns.

df_pivot = df.groupBy("month").pivot("category").sum("revenue")
df_pivot.show()

Output

month	Electronics	Fashion	Grocery
2025-01	12000	null	4000
2025-02	15000	3000	5000

Story

NeoMart’s dashboard team loves this pivoted table because each product category becomes a column — ready for BI tools.

3. Pivot with Known Values (More Efficient)

Specifying expected pivot columns improves performance.

categories = ["Electronics", "Fashion", "Grocery"]

df_pivot2 = df.groupBy("month") \
              .pivot("category", categories) \
              .sum("revenue")

df_pivot2.show()

Same output, faster execution.

4. Pivot Using Spark SQL

SELECT *
FROM (
  SELECT month, category, revenue FROM sales
)
PIVOT (
  SUM(revenue) FOR category IN ('Electronics', 'Fashion', 'Grocery')
)

5. Unpivot — Convert Columns Back to Rows (Tall Format)

PySpark doesn’t have a built-in unpivot() function. But there are 2 reliable methods:

Method 1 — Using `stack()`

df_unpivot = df_pivot.select(
    "month",
    F.expr("stack(3, 'Electronics', Electronics, 'Fashion', Fashion, 'Grocery', Grocery) as (category, revenue)")
)

df_unpivot.show()

Output

month	category	revenue
2025-01	Electronics	12000
2025-01	Fashion	null
2025-01	Grocery	4000
2025-02	Electronics	15000
2025-02	Fashion	3000
2025-02	Grocery	5000

Story

NeoMart’s ML team prefers unpivoted tall format because algorithms expect features in rows — not scattered across columns.

6. Generic Unpivot for Any DataFrame

def unpivot(df, by, cols):
    expr = "stack({0}, {1}) as (column, value)".format(
        len(cols),
        ", ".join(["'%s', %s" % (c, c) for c in cols])
    )
    return df.select(by + [F.expr(expr)])

Usage:

unpivot(df_pivot, ["month"], ["Electronics", "Fashion", "Grocery"]).show()

7. Unpivot Using Arrays + explode()

More flexible, works with any number of columns.

df_unp2 = df_pivot.select(
    "month",
    F.explode(
        F.array(
            F.struct(F.lit("Electronics").alias("category"), F.col("Electronics").alias("revenue")),
            F.struct(F.lit("Fashion").alias("category"), F.col("Fashion").alias("revenue")),
            F.struct(F.lit("Grocery").alias("category"), F.col("Grocery").alias("revenue"))
        )
    ).alias("row")
).select("month", "row.*")

df_unp2.show()

8. Best Practices for Pivot & Unpivot

✔ Define known pivot values for performance ✔ Use Spark SQL for complex pivoting logic ✔ Beware of high-cardinality pivots (thousands of columns) ✔ Cache results when pivoting large DataFrames ✔ Use unpivot for ML preprocessing & analytics normalization ✔ Validate nulls introduced during pivot

Summary

You now understand how to:

Pivot: turn rows into columns
Unpivot: turn columns back into rows
Use SQL and DataFrame APIs
Handle missing categories
Optimize pivot performance at scale

NeoMart’s reporting, analytics, and ML pipelines rely heavily on these reshaping techniques — and now you can build the same power into your PySpark workflows.

Next Topic → Join Optimization Techniques — Broadcast, Skew Handling & Efficient Joins

1. Sample Dataset​

Input Data​

2. Pivot — Convert Rows to Columns​

Output​

Story​

3. Pivot with Known Values (More Efficient)​

4. Pivot Using Spark SQL

5. Unpivot — Convert Columns Back to Rows (Tall Format)​

Method 1 — Using stack()​

Output​

Story​

6. Generic Unpivot for Any DataFrame​

7. Unpivot Using Arrays + explode()​

8. Best Practices for Pivot & Unpivot​

Summary​

1. Sample Dataset

Input Data

2. Pivot — Convert Rows to Columns

Output

Story

3. Pivot with Known Values (More Efficient)

5. Unpivot — Convert Columns Back to Rows (Tall Format)

Method 1 — Using `stack()`

Output

Story

6. Generic Unpivot for Any DataFrame

7. Unpivot Using Arrays + explode()

8. Best Practices for Pivot & Unpivot

Summary