Caching in Databricks — Best Practices
✨ Story Time — “My Query Is Fast… But Only the Second Time?”
Mila, a data analyst, runs a heavy analytical query:
SELECT region, SUM(total_sales)
FROM transactions
GROUP BY region;
The first time: 30 seconds The second time: 4 seconds
She wonders: “Why is it so much faster now?”
Her teammate smiles:
“That’s Databricks caching. Used right, it can speed up your entire Lakehouse.”
Let’s understand why caching can be your secret superpower.
🧩 What Is Caching in Databricks?
Caching means storing frequently accessed data in memory or local SSDs so queries run MUCH faster.
Databricks supports three types of caching:
- Delta Cache (Disk-based cache managed by Databricks)
- Spark Memory Cache (Stored in RAM with
CACHE TABLE) - Disk Cache for Databricks SQL Warehouses
Each one accelerates repeated queries by avoiding costly reads from cloud storage (S3/ADLS/GCS).
🔍 Type 1: Delta Cache (Databricks Runtime)
Activated automatically when reading from Delta tables.
How it works:
- Cached on each worker node
- Stored on local SSDs, not RAM
- Persistent during the cluster lifetime
- Great for BI dashboards & repeated table scans
Enable (if disabled):
SET spark.databricks.io.cache.enabled = true;
🔍 Type 2: Spark Memory Cache
Manually cache tables into memory:
CACHE SELECT * FROM transactions;
or:
CACHE TABLE transactions;
Best for:
- Heavy compute transformations
- Machine learning workloads
- Repeated DataFrame operations in notebooks
Limitations:
- Uses cluster RAM
- Cache is lost if cluster goes down
- Not ideal for very large tables
🔍 Type 3: Databricks SQL Warehouse Cache
Used by BI tools and dashboards.
Benefits:
- Fast interactive queries
- Stores results & metadata
- Extremely efficient for dashboards refreshing frequently
Works automatically behind the scenes — no setup required.
⚡ When Should You Use Caching?
✅ Use caching when:
✔ Running the same query multiple times ✔ Interactive analysis (SQL editor, notebooks, BI tools) ✔ Data fits into memory or SSD ✔ You want ultra-fast dashboard loads ✔ You run iterative ML transformations ✔ You have hot tables read frequently
❌ When NOT to Use Caching
Avoid caching when:
✖ Your table updates very frequently (cache invalidation overhead) ✖ Data is too large to fit in RAM or SSD ✖ You are running one-time ETL jobs ✖ Queries are always unique (no repetition) ✖ You are using Job Clusters (cache resets every job)
🧪 Real-World Example — 8× Faster Dashboard
Mila’s team has a PowerBI dashboard querying a Delta table every 10 minutes.
Before caching:
- 22-second load time
- Frequent cluster spikes
- Occasional timeouts
After enabling caching on the cluster:
CACHE TABLE sales_aggregated;
Results:
- Dashboard loads in 3 seconds
- Cluster CPU dropped by 45%
- BI team finally stopped complaining 🎉
🔧 How to Check If Cache Is Being Used
For Delta Cache:
DESCRIBE DETAIL delta.`/path/to/table`;
Or at runtime:
spark.conf.get("spark.databricks.io.cache.enabled")
For Memory Cache:
SHOW TABLES;
CLEAR CACHE;
🎯 Best Practices for Caching
🟩 1. Cache only hot datasets
Avoid caching huge cold data.
🟩 2. Use Delta Cache for most SQL workloads
Lightweight and automatic.
🟩 3. Use MEMORY cache only for DataFrame-heavy notebooks
Not for general SQL.
🟩 4. Don’t over-cache
Caching useless data = wasted resources.
🟩 5. Combine caching with Z-ORDER + OPTIMIZE
They complement each other for performance.
🟩 6. Tune cache size for large clusters
Use SSD-heavy clusters for maximum Delta Cache performance.
📘 Summary
- Caching improves query speed dramatically by storing frequently accessed data in memory or SSD.
- Databricks provides three caching layers: Delta Cache, Spark Memory Cache, and SQL Warehouse Cache.
- Best used for repeated queries, dashboards, ML workflows, and hot datasets.
- Avoid caching massive or frequently updated data.
- Use caching alongside OPTIMIZE, Z-ORDER, and file compaction for maximum Lakehouse performance.
Caching = Fast queries, low cost, happy analysts.
👉 Next Topic
Photon Execution Engine — When & Why to Use