Working with Large Files, Compression Types & Optimization Tips

Welcome back to RetailCo, our fictional retail company.
Alice, the data engineer, now faces a new challenge: loading massive historical sales and clickstream data efficiently.

“If we don’t handle large files and compression properly, loads will be slow, costly, and error-prone,” she explains.

Let’s explore how to work with large files, choose compression types, and optimize Snowflake performance.

🏗️ Challenge of Large Files

Large files can slow down ETL
Risk of time-out or memory issues
Higher storage and compute costs

RetailCo example: 500 GB of historical sales CSVs from vendors need to be loaded quickly for analytics.

🔹 1️⃣ Best Practices for Large Files

Split huge files into manageable chunks (~100 MB to 1 GB each)
Use external stages (S3, Azure, GCS) to avoid internal stage limits
Leverage Snowflake parallelism with multiple files
Avoid too many tiny files (less than 10 MB) — increases load overhead

Example:

Split 500 GB CSV into 500 files of ~1 GB
Load them in parallel using COPY INTO or Snowpipe

🔹 2️⃣ Compression Types

Snowflake supports automatic decompression for GZIP, BZIP2, ZSTD, and more:

Compression	Use Case	Pros	Cons
GZIP	CSV, JSON	Widely supported, reduces size 5–10x	Slower decompression
BZIP2	CSV, JSON	High compression ratio	Slower
ZSTD	Parquet	Very fast and efficient	Limited support outside Snowflake
NONE	Already compressed files	No overhead	Uses more storage

RetailCo example: Alice compresses large CSVs with GZIP to reduce storage and speed up loads.

COPY INTO SALES
FROM @S3_SALES_STAGE
FILE_FORMAT=(TYPE=CSV COMPRESSION=GZIP);

🔹 3️⃣ File Format Optimization

Parquet for large datasets → smaller, columnar, faster queries
CSV for simple ingestion, but compress it (GZIP)
JSON for nested data → use VARIANT column, compress with GZIP

Rule of thumb: Use columnar formats (Parquet/ORC) for analytics, row-based (CSV/JSON) for raw ingest.

🔹 4️⃣ Snowflake Load Optimization Tips

Use multiple files to leverage parallel loading
Clustered tables → improves query performance on large datasets
Avoid auto-compressing already compressed files
Use staged files efficiently (internal/external stages)
Monitor load performance via COPY_HISTORY or LOAD_HISTORY
Purge old staged files to save storage

🧩 RetailCo Real-World Scenario

Alice splits 500 GB CSVs into 500 files (~1 GB each)
Compresses them with GZIP
Stages them in S3 external stage
Loads in parallel using COPY INTO
Uses clustered table for faster aggregation queries

Outcome: ETL runs efficiently, cost is optimized, and dashboards are updated faster.

🧠 Quick Tips Checklist

Split large files → ~100 MB–1 GB
Compress files (GZIP/ZSTD) → reduces storage & network usage
Use Parquet for analytics-heavy tables
Leverage Snowflake parallelism by loading multiple files
Monitor load history and optimize warehouses for heavy loads

🏁 Quick Summary

Large files require splitting, compression, and staging for efficient Snowflake loads
Compression types: GZIP, BZIP2, ZSTD, NONE
File format: Parquet for analytics, CSV/JSON for raw ingestion
Use parallel loading, clustered tables, and staged files
Benefits: faster ETL, lower cost, optimized storage, improved query performance

🚀 Coming Next

👉 Snowflake Data Types Explained with Use Cases

🏗️ Challenge of Large Files​

🔹 1️⃣ Best Practices for Large Files​

🔹 2️⃣ Compression Types​

🔹 3️⃣ File Format Optimization​

🔹 4️⃣ Snowflake Load Optimization Tips​

🧩 RetailCo Real-World Scenario​

🧠 Quick Tips Checklist​

🏁 Quick Summary​

🚀 Coming Next​