Skip to main content

Performance Tuning — Partitions, Repartition, and Coalesce in PySpark

At NeoMart, large-scale pipelines often process billions of rows.
Efficient partitioning is critical to:

  • Reduce shuffle and disk I/O
  • Improve join and aggregation performance
  • Balance task distribution across executors

This chapter explains partitions, repartition, and coalesce with practical examples.


1. Understanding Partitions

  • Spark divides data into partitions → processed in parallel
  • Number of partitions affects parallelism and performance
  • Default partition count depends on cluster configuration and source
print(df.rdd.getNumPartitions())
  • Returns the number of partitions for df

2. Repartition — Increase or Shuffle Partitions

  • repartition(n) → reshuffles data into n partitions
  • Useful for parallelizing wide transformations or joins
# Repartition to 8 partitions
df_repart = df.repartition(8)
print(df_repart.rdd.getNumPartitions())

Story

NeoMart joins two large datasets.

  • Original partitions = 2 → tasks underutilized
  • Repartition to 8 → tasks distributed evenly → faster execution

3. Coalesce — Reduce Partitions Without Shuffle

  • coalesce(n) → reduces partitions without full shuffle
  • Ideal after filtering or aggregation
# Reduce partitions to 2 without shuffle
df_coalesce = df_repart.coalesce(2)
print(df_coalesce.rdd.getNumPartitions())
  • Avoids expensive shuffle operation → faster than repartition when reducing partitions

4. Practical Example

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName("NeoMart").getOrCreate()

data = [(i, f"product_{i}", i*10) for i in range(1, 101)]
df = spark.createDataFrame(data, ["id", "name", "price"])

# Check initial partitions
print("Initial partitions:", df.rdd.getNumPartitions())

# Increase partitions for join
df_repart = df.repartition(10)
print("After repartition:", df_repart.rdd.getNumPartitions())

# Filter expensive operation
df_filtered = df_repart.filter(F.col("price") > 500)

# Reduce partitions after filtering
df_final = df_filtered.coalesce(3)
print("After coalesce:", df_final.rdd.getNumPartitions())

Output

Initial partitions: 4
After repartition: 10
After coalesce: 3
  • Shuffles only occur in repartition, not in coalesce
  • Balanced partitions → better task parallelism

5. Best Practices

✔ Repartition before wide transformations or joins → balance tasks ✔ Coalesce after filtering or aggregations → avoid small files and shuffle ✔ Avoid too many partitions → overhead of task scheduling ✔ Avoid too few partitions → underutilized cluster resources ✔ Combine with partitioning on columns for large datasets


6. Story Example

NeoMart runs nightly ETL on 500 million rows:

  • Original partitions = 50 → join tasks uneven
  • Repartition to 200 → full cluster utilization → faster joins
  • After filtering high-value products → coalesce to 50 → fewer small files
  • Result → 3x faster ETL runtime

Summary

  • Partitions → control parallelism
  • Repartition(n) → increase partitions with shuffle
  • Coalesce(n) → decrease partitions without shuffle
  • Proper tuning → faster joins, reduced shuffle, optimized cluster utilization

NeoMart pipelines achieve high throughput and low latency by carefully repartitioning and coalescing DataFrames.


Next Topic → Introduction to Structured Streaming.