Skip to main content

Improving Lakehouse Performance — Dos & Don’ts

✨ Story Time — “Why Is the Lakehouse Getting Slower?”

Ethan is the lead data engineer at a fast-scaling startup.

What began as a simple Lakehouse with 10 tables has now grown to:

  • 200+ tables
  • Multiple ETL pipelines
  • Streaming + batch mixed workloads
  • Dashboards hitting data every few minutes

Suddenly:

  • Queries slow down
  • Jobs take longer
  • Costs spike
  • The CTO asks:
    “Is our Lakehouse breaking?”

Ethan takes a deep breath.
The Lakehouse is fine — it just needs proper performance tuning.

Let’s learn the right and wrong ways to do that.


🟩 DO: Compact Files & Run OPTIMIZE Regularly

Small files = slow reads + high compute cost.

OPTIMIZE table_name;
OPTIMIZE table_name ZORDER BY (important_column);

Use Cases:

  • High-ingestion tables
  • Streaming outputs
  • CDC pipelines

Result: Faster scans, better data skipping, lower cost.


🟥 DON'T: Let Tiny Files Accumulate

If your table has:

  • Thousands of small files
  • 10KB or 100KB files
  • Heavy streaming writes
  • Over-partitioning

… your performance will crash.

Symptoms:

  • Slow queries
  • High I/O
  • Excessive metadata operations

Prevent file explosion with Auto-Optimize:

ALTER TABLE my_table
SET TBLPROPERTIES (
'delta.autoOptimize.optimizeWrite' = 'true',
'delta.autoOptimize.autoCompact' = 'true'
);

🟩 DO: Use Z-ORDER for Filter Columns

Z-ORDER physically clusters related values.

Best for queries filtering by:

  • customer_id
  • event_date
  • device_id
  • sku or product_id

Example:

OPTIMIZE events ZORDER BY (device_id, event_timestamp);

Result: 5×–20× faster filter-based queries.


🟥 DON'T: Z-ORDER Too Many Columns

Z-ORDER works best with 1–3 columns.

Problems with too many columns:

✖ Slower OPTIMIZE ✖ Too much data movement ✖ No performance gain

Keep it simple → only Z-ORDER your primary filter keys.


🟩 DO: Use Photon for SQL Queries

Photon = Databricks’ ultra-fast C++ execution engine.

Perfect for:

  • BI dashboards
  • Large SQL aggregations
  • Ad-hoc analytics
  • Interactive data exploration

Enable Photon in SQL Warehouses & All-purpose clusters.


🟥 DON'T: Expect Photon to Speed Up Everything

Photon doesn’t accelerate:

✖ Python UDFs ✖ R or Scala-heavy workloads ✖ ML pipelines ✖ Non-SQL transformations

Use Photon only where it makes sense.


🟩 DO: Partition For Write Volume, Not Read Volume

Partitioning helps huge ingestion workloads.

Examples:

partition by datepartition by region


🟥 DON'T: Over-Partition

Too many partitions → tiny files → slow queries.

Bad examples:

✖ Partitioning by customer_id ✖ Partitioning by 1000+ values

Rule:

If a partition will contain < 256MB of data → don’t partition by it.


🟩 DO: Cache Hot Tables

Cache tables used repeatedly:

CACHE TABLE daily_sales;

Great for:

  • BI dashboards
  • Repeated queries
  • Interactive explorations

🟥 DON'T: Cache Large, Rarely Used Tables

Caching is not storage — it’s memory.

Avoid caching:

✖ 5 TB tables ✖ Cold data rarely queried ✖ Tables that update frequently


🟩 DO: Tune SQL Endpoints Properly

For BI dashboards:

✔ Use Pro or Serverless SQL Warehouses ✔ Enable autoscaling ✔ Tune concurrency ✔ Allow the warehouse to scale up during peak hours


🟥 DON'T: Run Dashboards on Job Clusters

Job clusters are:

  • Temporary
  • Not cached
  • Not optimized for BI
  • Slow for dashboards

Use SQL Warehouses instead.


🟩 DO: Reduce Shuffle & Skew

Data skew leads to:

  • Long-running tasks
  • Failed jobs
  • High compute cost

Fix skew by:

✔ Salting keys ✔ Using REPARTITION ✔ Broadcasting small tables ✔ Reducing unnecessary joins

Example:

SELECT /*+ BROADCAST(dim_table) */

🟩 DO: Monitor & Profile Queries

Databricks gives you powerful tools:

  • Query Profile
  • Query History
  • Spark UI
  • SQL Query Insights

Track:

  • Shuffle volume
  • Scan volume
  • Stage execution time
  • Skewed partitions
  • Photon usage

📘 Summary

Improving Lakehouse performance is a balance of:

✔ Storage optimization

  • Compact files
  • Z-ORDER
  • OPTIMIZE
  • Correct partitioning

✔ Compute optimization

  • Photon
  • Correct cluster sizing
  • Autoscaling

✔ Query optimization

  • Avoid SELECT *
  • Filter early
  • Broadcast joins
  • Tune SQL endpoints

✔ Architecture alignment

  • Use SQL Warehouses for BI
  • Avoid over-caching
  • Prevent small file problems

A well-tuned Lakehouse is fast, cost-efficient, and extremely scalable.


👉 Next Topic

Unity Catalog — Central Governance Explained