Skip to main content

Complex SQL Queries in PySpark

At NeoMart, simple queries are no longer enough.
Analysts and data engineers need insightful answers from massive datasets:

  • Top 3 products per category
  • Daily active users by region
  • Customers with multiple high-value orders

This is where complex SQL queries in PySpark become indispensable.
Spark SQL supports joins, subqueries, aggregations, and window functions at scale.


Why Complex Queries Matter

  • Combine multiple tables with joins
  • Perform conditional aggregations
  • Use subqueries for filtered or ranked results
  • Apply window functions to calculate running totals, rankings, or moving averages

Without these, insights remain partial or incomplete.


1. Joins in SQL Queries

SELECT c.customer_id, c.name, SUM(o.amount) AS total_spent
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC

Story Example

NeoMart wants total spending per customer to identify VIPs. SQL allows combining multiple tables in a single query.


2. Subqueries

SELECT *
FROM orders
WHERE customer_id IN (
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING SUM(amount) > 500
)

Use Case

  • Find high-value customers
  • Filter datasets based on aggregated conditions

Subqueries simplify complex filtering logic in a readable way.


3. Window Functions in SQL

SELECT customer_id, order_date, amount,
SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS running_total,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rank
FROM orders

Use Case

  • Track cumulative spending per customer
  • Rank latest orders for promotions
  • Analyze trends without reducing row-level data

4. Combining Joins, Subqueries, and Window Functions

SELECT c.customer_id, c.name, o.order_id, o.amount,
SUM(o.amount) OVER (PARTITION BY c.customer_id ORDER BY o.order_date) AS cumulative_amount
FROM customers c
JOIN orders o ON c.customer_id = o.customer_id
WHERE o.amount > 50

Story Example

NeoMart wants a detailed view of all customers’ orders above $50, with running totals to reward loyal shoppers.


5. Tips for Writing Efficient Complex Queries

  • Use temp views instead of repeatedly querying raw tables
  • Avoid selecting unnecessary columns to reduce shuffle
  • Prefer filtering early using WHERE clauses
  • Use broadcast joins for small lookup tables

These practices improve performance and reduce computation time in Databricks.


Summary

  • Spark SQL supports joins, subqueries, aggregations, and window functions for advanced analytics
  • Complex queries allow combining multiple datasets and performing rich computations
  • Use temp views and optimization techniques for large-scale Spark workflows
  • Mastering complex SQL queries bridges the gap between traditional SQL analysts and big data engineers

Next, we’ll dive into UDFs & UDAFs — Custom Functions in SQL, enabling custom logic and aggregations in Spark SQL.