First PySpark Job — Hello World Example
Now that we understand PySpark setup, SparkSession, and SparkContext, it’s time to write your first PySpark job. Think of this as the Hello World of distributed computing, but instead of printing text, we’ll process some data in parallel!
Step 1: Initialize SparkSession
Every PySpark job starts with a SparkSession:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("HelloWorldJob") \
.master("local[*]") \
.getOrCreate()
appName: Name of your Spark jobmaster: Execution mode (local[*]uses all CPU cores)
SparkSession automatically creates a SparkContext under the hood.
Step 2: Create Sample Data
We will create a small dataset to simulate a real-world scenario:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 28)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
Here, we are using a DataFrame, which is Spark’s high-level abstraction for structured data.
Step 3: Perform a Simple Transformation
Let’s filter out people younger than 30:
filtered_df = df.filter(df.Age >= 30)
This is equivalent to SQL WHERE clause in Spark.
Step 4: Show Results
filtered_df.show()
Expected output:
+----+---+
|Name|Age|
+----+---+
| Bob| 30|
+----+---+
Congratulations! You just ran your first PySpark job and applied a basic transformation.
Step 5: Stop SparkSession
Always stop the Spark session to free resources:
spark.stop()
Real-Life Example
At ShopVerse Retail, new ETL jobs often start like this:
- Load sample CSV files into a DataFrame
- Apply basic transformations (filter, groupBy, aggregate)
- Verify results locally using
show() - Deploy the same logic to a cluster for production scale
This approach allows safe experimentation without affecting production pipelines.
Key Takeaways
- Every PySpark job starts with a SparkSession.
- Use DataFrames for structured data processing.
- Apply transformations (
filter,map,select) to process data. - Always stop SparkSession to release cluster resources.
- Start small with sample data before scaling to large datasets.
Next, we will explore RDDs vs DataFrames vs Datasets in detail to understand which abstraction to choose for your Spark jobs.