File Operators β S3, GCS, and Local Filesystem
Every data pipeline has a moment where data becomes a file.
CSV exports.
JSON payloads.
Parquet partitions.
Log archives.
Airflow doesnβt process the data itself β
it coordinates how files move, arrive, and transform.
That coordination is handled by File Operators.
What Are File Operators in Airflow?β
File Operators manage:
- File transfers
- File existence checks
- Uploads and downloads
- Movement between local and cloud storage
They are built on top of storage hooks, giving you:
- Authentication via Airflow Connections
- Retry and logging support
- Consistent patterns across providers
When Should You Use File Operators?β
Ideal Use Casesβ
- Uploading data to S3 or GCS
- Downloading files for processing
- Moving files between buckets
- Archiving or cleaning up files
- Validating file availability
When Not to Use Themβ
- Row-level data transformations
- Streaming workloads
- Heavy compute logic
LocalFilesystem Operatorsβ
Letβs start with the simplest form β local files.
FileSensor (Local)β
from airflow.sensors.filesystem import FileSensor
FileSensor(
task_id="wait_for_local_file",
filepath="/data/input/sales_2024-01-10.csv",
poke_interval=60,
timeout=3600,
)
Inputβ
| Parameter | Value |
|---|---|
| filepath | /data/input/sales_2024-01-10.csv |
Outputβ
File detected successfully
Amazon S3 Operatorsβ
S3 is one of the most common storage layers in modern pipelines.
Uploading Files to S3β
from airflow.providers.amazon.aws.transfers.local_to_s3 import LocalFilesystemToS3Operator
LocalFilesystemToS3Operator(
task_id="upload_to_s3",
filename="/data/output/sales.csv",
dest_key="sales/2024/01/sales.csv",
dest_bucket="analytics-bucket",
aws_conn_id="aws_default",
)
Inputβ
| Source | Destination |
|---|---|
| /data/output/sales.csv | s3://analytics-bucket/sales/2024/01/sales.csv |
Outputβ
Upload completed successfully
Downloading Files from S3β
from airflow.providers.amazon.aws.transfers.s3_to_local import S3ToLocalFilesystemOperator
S3ToLocalFilesystemOperator(
task_id="download_from_s3",
bucket_name="raw-data",
object_name="events/events_2024-01-10.json",
filename="/tmp/events.json",
)
Inputβ
| Source | Destination |
|---|---|
| s3://raw-data/events_2024-01-10.json | /tmp/events.json |
Outputβ
File downloaded successfully
Google Cloud Storage (GCS) Operatorsβ
GCS operators mirror S3 patterns almost exactly.
Uploading Files to GCSβ
from airflow.providers.google.cloud.transfers.local_to_gcs import LocalFilesystemToGCSOperator
LocalFilesystemToGCSOperator(
task_id="upload_to_gcs",
src="/data/output/customers.csv",
dst="customers/2024/customers.csv",
bucket="analytics-gcs-bucket",
gcp_conn_id="google_cloud_default",
)
Inputβ
| Source | Destination |
|---|---|
| customers.csv | gs://analytics-gcs-bucket/customers/2024/customers.csv |
Outputβ
File uploaded to GCS
Downloading Files from GCSβ
from airflow.providers.google.cloud.transfers.gcs_to_local import GCSToLocalFilesystemOperator
GCSToLocalFilesystemOperator(
task_id="download_from_gcs",
bucket="raw-events",
object_name="2024/01/events.json",
filename="/tmp/events.json",
)
Templating Paths with Execution Dateβ
File operators fully support Jinja templating.
dest_key="sales/{{ ds }}/sales.csv"
This enables:
- Partitioned storage
- Date-based organization
- Backfill-friendly pipelines
File Operators & XComβ
Most file operators:
- Do not push XComs
- Rely on task success/failure
This is intentional β files are the contract.
File Operators vs Sensorsβ
| Use Case | Operator | Sensor |
|---|---|---|
| Move file | β | β |
| Check file exists | β | β |
| Wait for arrival | β | β |
Often used together:
- Sensor waits
- Operator moves or processes
Security Best Practicesβ
β Recommendedβ
- Use IAM roles or service accounts
- Avoid embedding access keys
- Limit bucket permissions
- Encrypt sensitive files
β Avoidβ
- Hardcoded credentials
- Overly broad bucket access
- Public buckets for internal data
Common Mistakesβ
β Mixing transformation logic with file movement
β Ignoring idempotency
β Hardcoding file paths
β Uploading partially written files
Real-World Use Casesβ
- Data lake ingestion
- ML feature storage
- Report generation
- Backup and archival workflows
- Cross-cloud data movement
Summaryβ
File Operators are the logistics layer of Airflow.
Key Takeaways:
- Move files reliably across systems
- Consistent patterns across S3, GCS, and local
- Deep integration with Airflow Connections
- Best used with sensors for event-driven workflows
They keep your pipelines organized, scalable, and cloud-native.
Whatβs Next?β
Next article in the series:
β‘οΈ HttpOperator & REST API Workflows