File Compaction & Delta File Management
✨ Story Time — “Why Are My Queries Slowing Down Every Week?”
Meet Arjun, a data engineer responsible for maintaining a busy Delta Lake table receiving:
- CDC updates every 5 minutes
- Batch data every hour
- Streaming inserts all day
At first, everything is fast.
But after a few weeks:
- Queries slow down
- Dashboards lag
- Costs increase
- Data engineers keep asking: “Why is Delta so slow now?”
Arjun opens the Delta table storage…
He sees THOUSANDS of tiny files — the dreaded Small File Problem.
He smiles again.
He knows exactly what’s needed:
➡ File Compaction & Proper Delta File Management.
🧩 What Is File Compaction in Delta Lake?
File compaction is the process of merging many small Delta files into fewer, larger, optimized files.
Why small files happen:
- Streaming writes produce small batches
- Frequent micro-batch ingest
- CDC jobs write small delta chunks
- Over-partitioning causes tiny files per partition
Small files = slow queries + high compute cost + too much metadata.
Compaction solves this by:
- Reducing file count
- Increasing file size
- Improving read performance
- Reducing metadata overhead
🔍 Why Small Files Hurt Performance
❌ More files = More metadata
Each query has to read metadata for every file → slower planning.
❌ More files = More unnecessary reads
Even if only 1 row matches the filter, Databricks still must scan many files.
❌ More files = Higher storage cost
Many tiny files create version bloat.
❌ More files = Slower Z-ORDER & OPTIMIZE
The more files you have, the heavier maintenance operations become.
Solution → Compaction through OPTIMIZE.
⚙️ How Delta Performs File Compaction
The key command:
OPTIMIZE my_delta_table;
What it does:
- Scans small files
- Groups and merges them
- Writes larger Parquet files (typically 128–512MB)
- Updates Delta transaction log
- Removes old small files (via VACUUM)
🔁 Automatic File Compaction (With Auto-Optimize)
Databricks also offers automated compaction:
ALTER TABLE my_delta_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);
What these do:
| Property | Action |
|---|---|
optimizeWrite | Writes fewer, larger files during ingest |
autoCompact | Merges files after small batch inserts |
Perfect for streaming or frequent batches.
🧪 Real-World Example — Before & After Compaction
Arjun’s table (before):
- 8,200 files per partition
- Avg file size: 40KB
- Query runtime: 34 seconds
After:
OPTIMIZE sales_data ZORDER BY (customer_id);
VACUUM sales_data RETAIN 168 HOURS;
- 320 files per partition
- Avg file size: 300MB
- Query runtime: 5 seconds
Improved performance, reduced cost, and less pressure on the cluster.
📦 Delta File Management — The Full Picture
Delta Lake automatically manages:
- Transaction logs (
_delta_log/) - Versioning
- Compaction
- Data skipping
- File pruning
- Data removal with VACUUM
But you must manage:
- When to compact
- How often to vacuum
- How to structure partitions
- How to avoid unnecessary file explosion
🎯 Best Practices for File Compaction
✅ 1. Compact high-ingestion tables regularly
Daily or weekly, depending on volume.
✅ 2. Enable Auto-Optimize for streaming workloads
Reduces small files during writes.
✅ 3. Combine OPTIMIZE with Z-ORDER
Boosts data skipping for faster queries.
✅ 4. Avoid over-partitioning
Too many partitions → too many tiny files.
✅ 5. Use VACUUM after compaction
Clean old files and free storage:
VACUUM my_delta_table RETAIN 168 HOURS;
✅ 6. Monitor file count
If files per partition > 1000 → compaction required.
📘 Summary
- File compaction merges small files into large, efficient ones.
- Small files slow down queries, inflate compute cost, and destroy performance.
- OPTIMIZE + Auto-Optimize are the main tools for managing Delta Lake storage.
- Use VACUUM to clear old files after compaction.
- Proper file management makes your Lakehouse fast, clean, and cost-efficient.
👉 Next Topic
Caching in Databricks — Best Practices