Skip to main content

Cluster Sizing — Choosing the Right Instance Type

✨ Story Time — “Why Is This Pipeline So Expensive?”

Sara is a data engineer managing multiple ETL pipelines:

  • Some jobs run slow
  • Some jobs fail randomly
  • Some cost too much
  • Analysts complain about dashboards being stuck

The CTO walks by:

“Sara, our cloud bill looks… scary.
Can we optimize our clusters?”

Sara nods.
Cluster sizing isn’t just about performance —
It’s about speed + stability + cost-efficiency all working together.

And Databricks gives you dozens of instance types…
Which one is the right choice?

Let’s simplify this.


🧩 What Is Cluster Sizing?

Cluster sizing is the process of choosing:

  • Node type (compute-optimized, memory-optimized, GPU, etc.)
  • Number of workers
  • Driver size
  • Autoscaling configuration
  • Spot vs On-demand nodes

Your choices directly impact:

  • Cost
  • Performance
  • Stability
  • Job success rate

Choosing the wrong cluster = Slow + Expensive.
Choosing the right cluster = Fast + Cheap.


🏗️ Types of Databricks Cluster Nodes

1. General Purpose (Balanced)

Use when you don’t know what to choose.

Great for:

  • Medium ETL jobs
  • Not-too-heavy SQL queries
  • Mixed workloads

Examples:

  • m5.xlarge
  • m5.2xlarge

2. Compute-Optimized

High CPU power — great for parallel workloads.

Best for:

✔ Photon workloads
✔ SQL-heavy jobs
✔ Aggregations & group-bys
✔ BI dashboards

Examples:

  • c5.xlarge
  • c5.2xlarge

3. Memory-Optimized

High RAM — great for large joins & heavy shuffle.

Best for:

✔ ETL pipelines
✔ machine learning feature joins
✔ caching large datasets

Examples:

  • r5.xlarge
  • r5.4xlarge

4. Storage-Optimized

Useful when you need fast local disk — e.g., Delta caching.

Best for:

✔ Photon
✔ Data skipping workloads
✔ Large Delta tables

Examples:

  • i3.xlarge
  • i3en.2xlarge

5. GPU Nodes

Best for ML training & deep learning, not SQL/ETL.

Examples:

  • p3.2xlarge
  • g4dn.xlarge

🚀 Choosing Worker Count

A common mistake:

Choosing too many or too few workers.

General rule:

Data VolumeRecommended Workers
< 50 GB2–4 workers
50–500 GB4–8 workers
500GB – 2TB8–16 workers
2TB+16–32 workers

Always start small → scale up only if needed.


🔄 Autoscaling Best Practices

🟩 Enable autoscaling

It saves cost by dynamically adjusting cluster size.

🟩 Keep min nodes small

Avoid paying for idle nodes.

🟩 Keep max nodes reasonable

Prevent runaway scaling.

Example:

Min Workers: 2
Max Workers: 10

🟩 Use Enhanced Autoscaling

Better for bursty and unpredictable workloads.


🧪 Real-World Example — Cost Saved by 40%

Sara’s ETL pipeline was running on:

  • 32 workers
  • r5.8xlarge (huge & expensive)
  • No autoscaling

Cost was $120/hour for a single daily job.

After right-sizing:

  • 8 workers
  • c5.2xlarge (cheaper & faster for SQL)
  • Autoscaling 4 → 12

New cost: $72/hour Performance: 30% faster Stability: Improved dramatically

Right sizing = $$$ saved + faster jobs.


📦 Cluster Sizing Checklist

🟩 1. What type of workload?

WorkloadBest Node Type
SQL / BICompute-optimized or Photon
ETLGeneral-purpose or memory-optimized
ML TrainingGPU
Delta-heavyStorage-optimized

🟩 2. How much data?

Size workers based on volume.

🟩 3. How much shuffling?

More shuffle = more memory needed.

🟩 4. Does caching matter?

Use i3 / i3en for fast SSD local caching.

🟩 5. Use spot instances for non-critical jobs

Spot = cheap On-demand = reliable


🎯 Best Practices for Cluster Sizing

  • Don’t oversize — start small and scale.
  • Use Photon for SQL-intensive workloads.
  • Enable autoscaling.
  • Use spot workers for non-critical pipelines.
  • Avoid GPU nodes unless doing ML.
  • Cache hot data only when useful.
  • Consider job clusters for ETL pipelines.
  • For production SQL dashboards → use Databricks SQL Warehouses, not clusters.

📘 Summary

  • Cluster sizing is essential for balancing speed, cost, and reliability.
  • Databricks offers multiple node types — choose based on workload.
  • Autoscaling and Photon can significantly improve efficiency.
  • Right-sized clusters reduce cost and increase performance.
  • Understanding your data volume and query patterns is the key to picking the right instance.

Choose smart clusters → save money → boost performance → make your team happy.


👉 Next Topic

SQL Endpoint Tuning — Query Performance Optimization