Skip to main content

Handling Semi-Structured Data (JSON, XML, Avro)

✨ Story Time — “How Do I Query JSON in Snowflake?”

Arjun, a data engineer, just received a dataset from a new SaaS platform.

Problem:

  • Data is JSON files
  • Nested arrays, mixed types, dynamic fields
  • Exported as XML for another system
  • Some files in Avro format

He’s used to traditional relational tables and asks:

“Do I need to flatten everything into columns? Do I need to preprocess outside Snowflake?”

The answer: No. Snowflake makes semi-structured data easy.


🧱 Step 1: Understanding Snowflake VARIANT Type

Snowflake uses VARIANT to store semi-structured data:

  • JSON
  • XML
  • Avro
  • Parquet
  • ORC

Example:

CREATE TABLE raw_data (
id STRING,
data VARIANT
);

Input: Declaring a table with two columns (id, data).

Output: An empty table raw_data is created successfully.


2️⃣ Loading Semi-Structured Data

JSON Example:

COPY INTO raw_data
FROM @my_stage/json_files/
FILE_FORMAT = (TYPE = 'JSON');

Input: JSON files staged in: @my_stage/json_files/

Output: Rows inserted into raw_data with JSON parsed as VARIANT.


XML Example:

COPY INTO raw_data
FROM @my_stage/xml_files/
FILE_FORMAT = (TYPE = 'XML');

Input: XML files staged in: @my_stage/xml_files/

Output: XML is parsed and stored in the VARIANT column.


Avro Example:

COPY INTO raw_data
FROM @my_stage/avro_files/
FILE_FORMAT = (TYPE = 'AVRO');

Input: Avro files located in: @my_stage/avro_files/

Output: Avro data is loaded and schema interpreted automatically.


3️⃣ Querying JSON / XML / Avro

JSON Example:

SELECT
data:customer.name AS customer_name,
data:customer.address.city AS city
FROM raw_data
WHERE data:customer.age > 30;

Input (sample JSON):

{
"customer": {
"name": "John",
"age": 35,
"address": { "city": "Bangalore" }
}
}

Output:

customer_namecity
JohnBangalore

XML Example:

SELECT
data:"customer"."name"::STRING AS customer_name,
data:"customer"."address"."city"::STRING AS city
FROM raw_data;

Input (sample XML):

<customer>
<name>Arjun</name>
<address><city>Hyderabad</city></address>
</customer>

Output:

customer_namecity
ArjunHyderabad

Avro Example:

SELECT
data:key1::STRING AS key1_value,
data:key2::NUMBER AS key2_value
FROM raw_data;

Input: Avro record with fields key1, key2.

Output:

key1_valuekey2_value
"A123"90

4️⃣ Flattening Nested Arrays

Example JSON:

{
"customer": "John",
"orders": [
{ "id": 1, "amount": 100 },
{ "id": 2, "amount": 150 }
]
}

Query:

SELECT
data:customer AS customer,
f.value:id AS order_id,
f.value:amount AS order_amount
FROM raw_data,
LATERAL FLATTEN(input => data:orders) f;

Input: Orders array with two objects.

Output:

customerorder_idorder_amount
John1100
John2150

🧪 Real-World Story — Arjun Queries SaaS Data

SELECT
f.value:eventType AS event_type,
COUNT(*) AS total_events
FROM raw_data,
LATERAL FLATTEN(input => data:events) f
GROUP BY event_type
ORDER BY total_events DESC;

Input: VARIANT object containing millions of event records.

Example small sample:

{
"events": [
{ "eventType": "login" },
{ "eventType": "purchase" },
{ "eventType": "logout" }
]
}

Output (example):

event_typetotal_events
login20000000
purchase15000000
logout10000000

💡 Best Practices

  1. Use VARIANT for raw storage — flexible and efficient
  2. Flatten only when needed — avoid unnecessary row explosion
  3. Filter early to reduce scanned micro-partitions
  4. Use materialized views for repeated queries
  5. Cast fields using ::TYPE when required

📘 Summary

Snowflake simplifies handling semi-structured data:

  • Store JSON, XML, Avro directly in VARIANT
  • Query with dot notation and FLATTEN
  • Nested and dynamic fields supported
  • No preprocessing or schema migration needed

Handling semi-structured data becomes fast, scalable, and simple.


👉 Next Topic

Flatten Function & VARIANT Type — Real Use Cases