Spark UI & Job Debugging Techniques — Monitor and Optimize PySpark Jobs
At NeoMart, monitoring PySpark jobs is crucial:
- Some jobs run slower than expected
- Some tasks fail silently or consume too much memory
- Optimizing joins, aggregations, and shuffles requires visibility
The Spark UI is the most powerful tool for analyzing job execution, stages, tasks, and memory usage.
1. Accessing Spark UI
- If running locally:
http://localhost:4040 - On cluster: Access through Spark history server or Databricks UI
- Tabs to focus on:
- Jobs → high-level view of actions
- Stages → detailed breakdown of tasks
- Storage → cached DataFrames/RDDs
- SQL → executed SQL queries and plans
- Environment → configuration and JVM details
2. Understanding Jobs and Stages
- Job: Triggered by an action (e.g.,
count(),show()) - Stage: Set of tasks that can run in parallel without shuffle
- Task: Execution unit processing a partition
Example:
df_filtered = df.filter(F.col("price") > 100)
df_filtered.count() # triggers a job
- One action → One job
- Spark UI → see stages, number of tasks, time, and shuffle info
3. DAG Visualization
-
Spark builds Directed Acyclic Graph (DAG) for transformations
-
Narrow vs wide transformations show differently:
- Narrow: straight line → no shuffle
- Wide: nodes merge → shuffle happens
Story: NeoMart analysts visualize DAG to spot shuffle-heavy operations slowing jobs.
4. Storage Tab — Caching and Persisting
-
Shows all cached DataFrames and RDDs
-
Displays:
- Storage level (memory/disk)
- Number of cached partitions
- Memory usage
df.cache()
df.count()
- Check Storage tab → confirm DataFrame cached
5. SQL Tab — Query Monitoring
-
For DataFrames registered as temp views or tables, Spark SQL tab shows:
- Executed queries
- Physical plan
- Execution time
df.createOrReplaceTempView("products")
spark.sql("SELECT AVG(price) FROM products").show()
- UI shows aggregation stage and tasks executed
6. Debugging Common Issues
-
Long-running tasks:
- Often due to data skew → consider salting or repartitioning
-
High shuffle write/read:
- Use broadcast joins for small tables
-
Executor OOM (Out of Memory):
- Persist intermediate results to disk
- Increase executor memory
-
Stragglers:
- Skewed keys → repartition or salt
7. Using Spark History Server
-
Tracks completed jobs for offline analysis
-
Steps:
-
Enable event logging:
spark.conf.set("spark.eventLog.enabled", "true")
spark.conf.set("spark.eventLog.dir", "/tmp/spark-events") -
Open history server → view past jobs, DAGs, stages, tasks
-
Story: NeoMart can analyze nightly ETL jobs and spot inefficient transformations even after job completion.
8. Tips for Effective Job Debugging
✔ Use Spark UI DAG to identify wide transformations ✔ Monitor shuffle read/write bytes → optimize joins and aggregations ✔ Cache frequently reused DataFrames ✔ Check task distribution → prevent stragglers ✔ Use SQL tab for complex query optimization
Summary
Using Spark UI and history server, you can:
- Monitor jobs, stages, and tasks
- Visualize DAGs for performance insight
- Debug skew, shuffle, and memory issues
- Optimize iterative and large-scale pipelines
NeoMart engineers rely on Spark UI to save hours of troubleshooting and make PySpark pipelines production-ready.
Next Topic → Catalyst Optimizer & Tungsten Execution Engine