Dates & Timestamps in PySpark DataFrames

Working with dates and timestamps is essential for analytics, forecasting, time-series modeling, and demand planning.
PySpark provides powerful built-in functions to extract, transform, manipulate, and aggregate date-related fields.

This guide walks you step-by-step through date extraction, aggregations, SQL queries, and derived metrics.

Loading Data

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('dates').getOrCreate()

df = spark.read.csv('/path/to/demand_planning.csv', header=True, inferSchema=True)

Preview & Schema

df.show()

Example output:

+----------+--------+-------------+-----------+------------+
|      Date| Product|ForecastUnits|ActualUnits|LeadTimeDays|
+----------+--------+-------------+-----------+------------+
|2023-01-01|    Soap|          100|         95|           3|
|2023-01-02|    Soap|          120|        110|           2|
|2023-01-03| Shampoo|           80|         85|           4|
|2023-01-04| Shampoo|           90|         88|           3|
|2023-02-01|    Soap|          130|        125|           2|
|2023-02-02| Shampoo|           85|         82|           4|
|2023-03-01|    Soap|          140|        135|           3|
|2023-03-05| Shampoo|          100|        102|           3|
+----------+--------+-------------+-----------+------------+

df.printSchema()

Output:

root
 |-- Date: date (nullable = true)
 |-- Product: string (nullable = true)
 |-- ForecastUnits: integer (nullable = true)
 |-- ActualUnits: integer (nullable = true)
 |-- LeadTimeDays: integer (nullable = true)

Note: Some Spark versions read dates as strings. You can convert using: df = df.withColumn("Date", to_date("Date"))

Extracting Date Components in PySpark

PySpark provides functions to extract day, month, year, and week numbers.

from pyspark.sql.functions import dayofmonth, month, year, weekofyear

df.select(
    dayofmonth(df['Date']).alias('Day'),
    month(df['Date']).alias('Month'),
    year(df['Date']).alias('Year'),
    weekofyear(df['Date']).alias('WeekOfYear')
).show()

Output:

+---+-----+----+----------+
|Day|Month|Year|WeekOfYear|
+---+-----+----+----------+
| 1 |  1  |2023|    52    |
| 2 |  1  |2023|     1    |
| 3 |  1  |2023|     1    |
| 4 |  1  |2023|     1    |
| 1 |  2  |2023|     5    |
| 2 |  2  |2023|     5    |
| 1 |  3  |2023|     9    |
| 5 |  3  |2023|    10    |
+---+-----+----+----------+

✔️ What This Does

Extracts different date components for time-series analysis, forecasting, and reporting.

Aggregation by Year

from pyspark.sql.functions import avg

df_with_year = df.withColumn('Year', year(df['Date']))

df_with_year.groupBy('Year') \
    .mean() \
    .select('Year', 'avg(ActualUnits)') \
    .show()

Output:

+----+------------------+
|Year|avg(ActualUnits)  |
+----+------------------+
|2023|  103.375         |
+----+------------------+

✔️ Meaning

Computes yearly averages—useful for trend analysis and demand forecasting.

Creating Ratios / Derived Metrics

df.select(
    (df['ForecastUnits'] / df['ActualUnits']).alias('Forecast_to_Actual')
).show()

Output:

+--------------------+
|Forecast_to_Actual  |
+--------------------+
|1.0526315789473684  |
|1.0909090909090908  |
|0.9411764705882353  |
|1.0227272727272727  |
|1.04                |
|1.0365853658536585  |
|1.037037037037037   |
|0.9803921568627451  |
+--------------------+

✔️ Meaning

Shows forecasting accuracy by comparing forecast vs actual.

Running SQL Queries with Dates

df.createOrReplaceTempView('demand')

Query: Find Date(s) With Maximum Actual Units

spark.sql("""
  SELECT Date, ActualUnits 
  FROM demand 
  WHERE ActualUnits = (SELECT MAX(ActualUnits) FROM demand)
""").show()

Output:

+----------+-----------+
| Date     |ActualUnits|
+----------+-----------+
|2023-03-05|       102 |
+----------+-----------+

✔️ Meaning

Returns the highest-demand day(s) using SQL subqueries.

More Aggregations & Statistics

from pyspark.sql.functions import round, max, min

df.select(
    round(avg(df['ActualUnits']), 2).alias('AvgActual'),
    max(df['ForecastUnits']).alias('MaxForecast'),
    min(df['ForecastUnits']).alias('MinForecast')
).show()

Output:

+-----------+-------------+-------------+
| AvgActual | MaxForecast | MinForecast |
+-----------+-------------+-------------+
|   103.38  |     140     |     80      |
+-----------+-------------+-------------+

✔️ Meaning

Computes multiple KPIs in a single query.

🟦 1-Minute Summary — Dates & Time in PySpark

Code / Expression	What It Does
`dayofmonth()`, `month()`, `year()`, `weekofyear()`	Extract components from a Date column
`withColumn('Year', year(...))`	Adds new year column
`groupBy('Year').mean()`	Aggregates metrics by year
`(ForecastUnits / ActualUnits)`	Computes forecast accuracy
`createOrReplaceTempView()`	Registers DataFrame for SQL
`SELECT ... WHERE ActualUnits = (MAX...)`	Finds rows with maximum demand
`round(avg(), 2)`, `max()`, `min()`	Summary statistics

Next, we’ll cover Using Spark SQL basic — Register Temp Views and Query.

Loading Data​

Preview & Schema​

Extracting Date Components in PySpark

✔️ What This Does​

Aggregation by Year

✔️ Meaning​

Creating Ratios / Derived Metrics

✔️ Meaning​

Running SQL Queries with Dates

Query: Find Date(s) With Maximum Actual Units​

✔️ Meaning​

More Aggregations & Statistics

✔️ Meaning​

🟦 1-Minute Summary — Dates & Time in PySpark

Loading Data

Preview & Schema

✔️ What This Does

✔️ Meaning

✔️ Meaning

Query: Find Date(s) With Maximum Actual Units

✔️ Meaning

✔️ Meaning