Skip to main content

Partitioning & Bucketing in PySpark — Optimize Large Dataset Performance

At NeoMart, the data team works with billions of rows every day:

  • Sales records by date and region
  • Clickstream events by user and session
  • Product catalogs across multiple categories

Reading all data every time is slow and expensive.
Partitioning and bucketing let Spark skip irrelevant data and reduce shuffle during joins and aggregations.


1. Understanding Partitioning

Partitioning physically splits data into directories based on column values.

  • Each partition = a subdirectory
  • Example: partition by year or region
  • Spark can prune partitions during queries

Example Dataset

data = [
("2025-01-01", "Electronics", 12000),
("2025-01-01", "Grocery", 4000),
("2025-02-01", "Electronics", 15000),
("2025-02-01", "Fashion", 3000)
]

df = spark.createDataFrame(data, ["date", "category", "revenue"])

2. Writing Partitioned Data

df.write.mode("overwrite") \
.partitionBy("category") \
.parquet("/tmp/neo_partitioned")
  • Creates a folder for each category: /tmp/neo_partitioned/category=Electronics/ /tmp/neo_partitioned/category=Grocery/

  • Querying category='Electronics' reads only that partition.


3. Reading Partitioned Data

df_read = spark.read.parquet("/tmp/neo_partitioned/category=Electronics")
df_read.show()

Output

datecategoryrevenue
2025-01-01Electronics12000
2025-02-01Electronics15000

Partition pruning reduces I/O and speeds up queries.


4. Partitioning Best Practices

✔ Partition by high-cardinality but selective columns (like date, region) ✔ Avoid partitioning by too many columns → too many small files ✔ Keep less than 1000 partitions per table for efficiency ✔ Combine with predicate filtering for query speed


5. Understanding Bucketing

Bucketing divides data into fixed buckets using a hash function on a column.

  • Unlike partitioning, bucketed data is in files within the same folder
  • Useful for joins, aggregations, sampling
  • Bucketing enables co-located joins → no shuffle

6. Writing Bucketed Tables

df.write.mode("overwrite") \
.bucketBy(4, "category") \
.sortBy("revenue") \
.saveAsTable("bucketed_sales")
  • 4 buckets for category
  • Within each bucket, data sorted by revenue

7. Reading & Joining Bucketed Tables

bucketed_df = spark.table("bucketed_sales")

# Optimized join with another bucketed table
other_df = spark.table("bucketed_sales")
joined_df = bucketed_df.join(other_df, "category")
  • Spark can perform bucketed joins → avoids full shuffle
  • Great for large fact-dimension joins

8. Partitioning vs Bucketing — Quick Comparison

FeaturePartitioningBucketing
Physical layoutSubdirectoriesFiles within a directory
PurposeQuery pruningOptimized joins/aggregations
Column typeLow/moderate cardinalityHigh cardinality
Shuffle impactReduces I/OReduces shuffle during join
Sort insideOptionalCan sort inside bucket

9. Best Practices

✔ Partition by date for time-series data ✔ Bucket large dimension tables by join key ✔ Combine partitioning + bucketing for large fact tables ✔ Use sortBy() within buckets for faster aggregation ✔ Monitor number of files to avoid small-file problem


10. Story Example

NeoMart stores billions of orders.

  • Partition by year and month → only read relevant months
  • Bucket by customer_id → join with customer master table without shuffle
  • Result → queries 5x faster, cluster utilization optimized

Summary

With partitioning and bucketing, you can:

  • Reduce I/O with partition pruning
  • Minimize shuffle in large joins
  • Sort data within buckets for fast aggregation
  • Build scalable, production-ready pipelines

Mastering these techniques makes PySpark efficient at scale, just like NeoMart handles billions of daily transactions.


Next Topic → Caching, Persisting, and Memory Management