Skip to main content

Multi-Task Job Workflows β€” Dependencies Across Tasks

🎬 Story Time β€” β€œOne Task Fails And Everything Breaks…”​

Arjun, a senior data engineer, maintains a pipeline that:

  1. Extracts data from APIs
  2. Cleans & transforms it
  3. Loads it to Delta Lake
  4. Validates quality
  5. Sends success notifications

Unfortunately, these steps were split across five separate jobs.

When the extraction job fails, the transform job still runs.
When transformation fails, the notification job still says β€œpipeline completed.”

Arjun sighs:

β€œI need something that ties everything together… with dependencies… and intelligence.”

Enter Databricks Multi-Task Job Workflows β€” the Lakehouse-native orchestration layer.


πŸ”₯ 1. What Are Multi-Task Job Workflows?​

A workflow in Databricks is a single job that contains multiple tasks with:

  • Task dependencies
  • Conditional logic
  • Modular execution
  • Shared compute clusters
  • Automatic DAG orchestration

Perfect for building end-to-end ETL pipelines in a single pane.


🧱 2. Creating a Multi-Task Workflow​

Arjun opens:

Workflows β†’ Jobs β†’ Create Job

Then clicks β€œAdd Task” multiple times.

Example workflow:


extract β†’ transform β†’ load β†’ validate β†’ notify

Each task can be:

  • Notebook
  • Python script
  • SQL query
  • JAR
  • Delta Live Table pipeline
  • dbt task (new)
  • dbt CLI runner
  • Or a combination

πŸ”— 3. Defining Task Dependencies​

Databricks uses a clean dependency UI:


[extract] β†’ [transform] β†’ [load]
↓
[validate]
↓
[notify]

A task only runs after its upstream tasks succeed.

Example:​

{
"task_key": "transform",
"depends_on": [{"task_key": "extract"}]
}

Dependencies can form:

  • Linear DAGs
  • Fan-in DAGs
  • Fan-out DAGs
  • Branching pipelines

πŸ§ͺ 4. Example: Notebook-Based Multi-Task Pipeline​

Step 1 β€” Extract​

df_raw = spark.read.format("json").load("/mnt/raw/api_logs/")
df_raw.write.format("delta").mode("overwrite").save("/mnt/stage/logs_raw")

Step 2 β€” Transform​

df = spark.read.format("delta").load("/mnt/stage/logs_raw")
df_clean = df.filter("event IS NOT NULL")
df_clean.write.format("delta").mode("overwrite").save("/mnt/clean/logs_clean")

Step 3 β€” Validation​

from pyspark.sql import functions as F

df = spark.read.format("delta").load("/mnt/clean/logs_clean")
if df.filter(F.col("event").isNull()).count() > 0:
raise Exception("Data validation failed")

Step 4 β€” Notify​

dbutils.notebook.exit("Success: ETL Pipeline Completed")

βš™οΈ 5. Shared Job Cluster​

Arjun selects:

  • Job cluster (cheaper than all-purpose clusters)
  • Applies it to all tasks
  • Auto-terminate after 5 minutes

This avoids cluster spin-ups for every task.


πŸ”„ 6. Retry Logic Per Task​

Instead of retrying the whole job: Arjun can retry only the failing task.

Task-level retry settings:

  • Retry attempts
  • Backoff
  • Timeout
  • Cluster retry vs task retry

This makes workflows extremely resilient.


🧯 7. Error Handling Across Tasks​

Databricks supports:

  • βœ” Stop entire pipeline on failure

  • βœ” Run downstream tasks only if upstream succeeds

  • βœ” Add "failure notification" as a separate branch

  • βœ” On-failure triggers for Slack/email

Example branch:

validate_failed β†’ slack_alert

πŸŒ‰ 8. Branching Logic Inside Workflows​

Arjun builds a logic:

high_volume β†’ process_big_data
else β†’ process_small_data

Branches allow conditional processing depending on:

  • Input size
  • Date
  • Event type
  • External parameters

This is Databricks' version of lightweight if-else orchestration.


πŸ“Š 9. Real-World Enterprise Use Cases​

⭐ Finance​

Multi-step risk scoring β†’ aggregation β†’ validation β†’ reporting.

⭐ Retail​

Daily SKU extraction β†’ price rules β†’ promotions β†’ BI delivery.

⭐ Healthcare​

PHI ingestion β†’ anonymization β†’ validation β†’ controlled-zone storage.

⭐ Logistics​

GPS ingest β†’ cleaning β†’ route clustering β†’ ML scoring β†’ dashboard refresh.

⭐ Manufacturing​

Sensor data β†’ dedupe β†’ QC β†’ anomaly detection.


🧠 Best Practices​

  1. Keep tasks modular (single purpose per task)
  2. Use job clusters for cost control
  3. Add alerts + slack notifications
  4. Add validation task before loading curated data
  5. Use task parameters instead of hardcoding
  6. Enable run-as service principals for security
  7. Store job configs in repos for version control

πŸŽ‰ Real-World Ending β€” β€œThe Pipeline is Finally Smart”​

Now Arjun’s ETL:

  • understands dependencies
  • retries failures automatically
  • alerts the team instantly
  • uses clean DAG orchestration
  • cuts compute cost with shared job clusters

His manager says:

β€œThis is the pipeline architecture we should have done years ago.”

And everyone finally stops blaming Arjun’s pipelines.


πŸ“˜ Summary​

Databricks Multi-Task Job Workflows provide:

  • βœ” DAG orchestration

  • βœ” Multiple task types

  • βœ” Dependency management

  • βœ” Shared job clusters

  • βœ” Conditional branching

  • βœ” Retry & alerting

  • βœ” Production-grade pipeline automation

A core building block for enterprise-scale data workflows.


πŸ‘‰ Next Topic

Databricks Workflows (New) β€” Production Orchestration