Skip to main content

Sorting, Sampling & Limit in PySpark DataFrames — Efficient Data Retrieval at Scale

At NeoMart, data analysts often ask questions like:

  • “Give me the top 10 highest-selling products.”
  • “Fetch a random 1% sample for model testing.”
  • “Sort all orders by timestamp for fraud analysis.”

Though simple, operations like sort, sample, and limit can get expensive at scale.
This chapter teaches how to use them efficiently on large distributed datasets.


1. Sample DataFrame

from pyspark.sql import functions as F

df = spark.createDataFrame([
(1, "Laptop", 1200),
(2, "Mouse", 25),
(3, "Monitor", 220),
(4, "Keyboard", 75),
(5, "Headset", 60)
], ["product_id", "name", "price"])

df.show()

Input Data

product_idnameprice
1Laptop1200
2Mouse25
3Monitor220
4Keyboard75
5Headset60

2. Sorting DataFrames

Sorting is done using:

orderBy()sort() (alias)

df_sorted = df.orderBy(F.desc("price"))
df_sorted.show()

Output

product_idnameprice
1Laptop1200
3Monitor220
4Keyboard75
5Headset60
2Mouse25

Story

NeoMart’s fraud-detection team sorts orders by price to catch unusually high transactions.


3. Multi-Column Sorting

df.orderBy(F.desc("price"), F.asc("name")).show()

When keys tie, the next key breaks the tie.


4. Sorting Large Datasets — Performance Tips

✔ Sorting triggers a global shuffle ✔ Use partitioning to reduce sort costs ✔ Use sortWithinPartitions() when global order isn’t required

Example

df_small_sort = df.sortWithinPartitions("price")

Useful for window functions or per-partition ordering.


5. Limit — Fetching Top N Rows

limit() retrieves a fixed number of rows.

df.limit(3).show()

Output

product_idnameprice
1Laptop1200
2Mouse25
3Monitor220

Important: limit() does not guarantee order. Combine with orderBy() for deterministic results.


6. Top-N Query (Sort + Limit)

df.orderBy(F.desc("price")).limit(2).show()

Output

product_idnameprice
1Laptop1200
3Monitor220

7. Sampling — Random Subset of Data

Sampling helps NeoMart’s ML team train models faster.

7.1 Random Sampling

df_sample = df.sample(withReplacement=False, fraction=0.4, seed=42)
df_sample.show()

Possible Output

product_idnameprice
3Monitor220
5Headset60

7.2 Sampling With Replacement

df_replace = df.sample(withReplacement=True, fraction=0.5)

Useful in bootstrapping algorithms.


7.3 Stratified Sampling (sampleBy)

Use when each group must be sampled proportionally.

df_category = spark.createDataFrame([
("Electronics", 1),
("Electronics", 2),
("Fashion", 3),
("Fashion", 4)
], ["category", "id"])

fractions = {"Electronics": 0.5, "Fashion": 1.0}

df_stratified = df_category.sampleBy("category", fractions)
df_stratified.show()

Output (varies)

categoryid
Electronics2
Fashion3
Fashion4

8. Limit vs Sample — When to Use What?

RequirementUse
Get fixed top N rowslimit()
Need randomnesssample()
Need proportional samplingsampleBy()
Top N by specific metricorderBy().limit()

9. Using Sorting, Sampling & Limit in SQL

Sorting

SELECT * FROM products ORDER BY price DESC;

Sampling (Random rows)

SELECT * FROM products TABLESAMPLE(50 PERCENT);

Limit

SELECT * FROM products LIMIT 5;

10. Best Practices

✔ Avoid global sorting on multi-terabyte data ✔ Use sortWithinPartitions() when global order not required ✔ Keep seed fixed for reproducible sampling ✔ Always combine orderBy + limit for stable top-N queries ✔ Avoid sampling huge datasets repeatedly → cache first


Summary

You now know how to:

  • Sort DataFrames efficiently
  • Retrieve random or stratified samples
  • Limit output for fast previews
  • Optimize sorting and sampling operations
  • Balance correctness vs performance

Sorting, sampling, and limiting are essential for analytics, ML pipelines, data previewing, and debugging at NeoMart — and now you can use them expertly too.


Next Topic → Partitioning & Bucketing for Large Datasets