Skip to main content

RDD Persistence & Caching — Memory Management in Spark

NeoMart’s data team is running an advanced analytics pipeline on customer clickstream logs. The process includes:

  • Cleaning raw data
  • Extracting session-level metrics
  • Running machine learning transformations
  • Aggregating results for dashboards

Each step uses the same processed RDD multiple times.

But there is a problem:
Running the entire pipeline again and again takes too long. Spark must recompute every transformation from scratch, rebuilding lineage and rerunning all upstream stages.

Enter RDD Persistence & Caching — Spark’s way of remembering data for faster computations, saving precious time, money, and compute resources.


Why Do We Need Caching?

Spark uses lazy evaluation, meaning RDD transformations are not executed unless an action triggers them.
So if an RDD is used multiple times:

cleaned_data.count()
cleaned_data.take(10)
cleaned_data.saveAsTextFile("/mnt/output")

Spark recomputes cleaned_data three separate times unless you cache it.

Caching solves this by storing the RDD in memory (or memory + disk) so repeated access is instant.


cache() vs persist()

Spark provides two main ways to store RDDs:

### 1. cache()

Stores RDD in memory only.

rdd.cache()

Equivalent to:

rdd.persist(StorageLevel.MEMORY_ONLY)

### 2. persist()

Allows specifying different storage levels.

from pyspark import StorageLevel

rdd.persist(StorageLevel.MEMORY_AND_DISK)

Used when data may not fit entirely in memory.


Available Storage Levels

Storage LevelDescription
MEMORY_ONLYFastest, but may fail if RDD doesn’t fit in memory
MEMORY_AND_DISKStores what fits in memory; spills the rest to disk
DISK_ONLYSlower, but ensures full persistence
MEMORY_ONLY_SERSerialized in memory — reduces size but increases CPU cost
MEMORY_AND_DISK_SERBalanced storage & reliability
OFF_HEAPFor external memory (Tungsten), rare in typical workloads

Story Example: NeoMart Recommendation Pipeline

NeoMart runs a sessionization workflow to build personalized recommendations.

Without caching:

  • Each model training iteration recomputes raw logs
  • Session extraction runs again
  • Feature engineering runs again
  • Total time: 45 minutes

With caching:

sessions = (
logs
.filter(lambda x: "session" in x)
.map(parse_session)
)

sessions.cache()

model = train_recommendation_model(sessions)

Total time drops to 8 minutes.

Caching saved them over 80% compute time.


When Should You Cache an RDD?

When the RDD is reused multiple times

Example: Training multiple ML models with the same preprocessed data.

When recomputation cost is expensive

Example: Custom parsing, UDFs, joins, or external IO.

When performing iterative algorithms

  • PageRank
  • K-Means
  • Gradient descent loops

When running multiple actions on the same RDD

Such as count(), take(), collect(), saveAsTextFile().


When Not to Cache

❌ RDD is used only once

Caching wastes memory.

❌ RDD is too large to fit in memory

Prefer MEMORY_AND_DISK or avoid caching.

❌ Using DataFrames instead

Spark automatically optimizes them with Catalyst & Tungsten.


How to Uncache / Remove from Memory

Memory is limited. After you're done, always clean up:

rdd.unpersist()

Or remove all cached objects:

spark.catalog.clearCache()

Debugging: How to See Cached RDDs

In Databricks or Spark UI:

  • Open the Storage tab
  • View size, partitions, and storage level
  • Monitor memory usage
  • Identify partitions not cached due to size

This helps optimize cluster resources effectively.


Example: Full Pipeline Using cache() and persist()

from pyspark import StorageLevel

logs = sc.textFile("/mnt/neomart/raw_logs")

clean = logs \
.filter(lambda x: "event" in x) \
.map(lambda x: parse_event(x))

clean.persist(StorageLevel.MEMORY_AND_DISK)

# Perform multiple actions without recomputation
print(clean.count())
print(clean.take(5))

daily_stats = clean \
.map(lambda x: (x.date, 1)) \
.reduceByKey(lambda x, y: x + y)

Best Practices for RDD Caching

🔹 Cache early in iterative algorithms

Avoid repeating expensive transformations.

🔹 Use MEMORY_ONLY when data fits

Fastest option.

🔹 Use MEMORY_AND_DISK when unsure

Safe and reliable.

🔹 Don’t cache everything

Be selective to avoid memory pressure.

🔹 Clean up with unpersist()

Especially in long Databricks jobs.


Summary — Caching Makes Spark Lightning Fast

  • RDD caching prevents expensive recomputations.
  • cache() stores data in memory; persist() lets you choose storage levels.
  • Useful for ML loops, repeated actions, and expensive pipelines.
  • Improves performance and reduces cluster cost.
  • Spark UI helps monitor cached datasets and memory usage.

Caching is one of the most powerful performance tools in Spark — when used wisely, it turns slow pipelines into near real-time workflows.


Next, we’ll cover Creating DataFrames from CSV, JSON, Parquet, and Hive Tables.