Handling Semi-Structured Data (JSON, XML, Avro)

✨ Story Time — “How Do I Query JSON in Snowflake?”

Arjun, a data engineer, just received a dataset from a new SaaS platform.

Problem:

Data is JSON files
Nested arrays, mixed types, dynamic fields
Exported as XML for another system
Some files in Avro format

He’s used to traditional relational tables and asks:

“Do I need to flatten everything into columns? Do I need to preprocess outside Snowflake?”

The answer: No. Snowflake makes semi-structured data easy.

🧱 Step 1: Understanding Snowflake VARIANT Type

Snowflake uses VARIANT to store semi-structured data:

JSON
XML
Avro
Parquet
ORC

Example:

CREATE TABLE raw_data (
    id STRING,
    data VARIANT
);

Input: Declaring a table with two columns (id, data).

Output: An empty table raw_data is created successfully.

2️⃣ Loading Semi-Structured Data

JSON Example:

COPY INTO raw_data
FROM @my_stage/json_files/
FILE_FORMAT = (TYPE = 'JSON');

Input: JSON files staged in: @my_stage/json_files/

Output: Rows inserted into raw_data with JSON parsed as VARIANT.

XML Example:

COPY INTO raw_data
FROM @my_stage/xml_files/
FILE_FORMAT = (TYPE = 'XML');

Input: XML files staged in: @my_stage/xml_files/

Output: XML is parsed and stored in the VARIANT column.

Avro Example:

COPY INTO raw_data
FROM @my_stage/avro_files/
FILE_FORMAT = (TYPE = 'AVRO');

Input: Avro files located in: @my_stage/avro_files/

Output: Avro data is loaded and schema interpreted automatically.

3️⃣ Querying JSON / XML / Avro

JSON Example:

SELECT
    data:customer.name AS customer_name,
    data:customer.address.city AS city
FROM raw_data
WHERE data:customer.age > 30;

Input (sample JSON):

{
  "customer": {
    "name": "John",
    "age": 35,
    "address": { "city": "Bangalore" }
  }
}

Output:

customer_name	city
John	Bangalore

XML Example:

SELECT
    data:"customer"."name"::STRING AS customer_name,
    data:"customer"."address"."city"::STRING AS city
FROM raw_data;

Input (sample XML):

<customer>
  <name>Arjun</name>
  <address><city>Hyderabad</city></address>
</customer>

Output:

customer_name	city
Arjun	Hyderabad

Avro Example:

SELECT
    data:key1::STRING AS key1_value,
    data:key2::NUMBER AS key2_value
FROM raw_data;

Input: Avro record with fields key1, key2.

Output:

key1_value	key2_value
"A123"	90

4️⃣ Flattening Nested Arrays

Example JSON:

{
  "customer": "John",
  "orders": [
    { "id": 1, "amount": 100 },
    { "id": 2, "amount": 150 }
  ]
}

Query:

SELECT
    data:customer AS customer,
    f.value:id AS order_id,
    f.value:amount AS order_amount
FROM raw_data,
LATERAL FLATTEN(input => data:orders) f;

Input: Orders array with two objects.

Output:

customer	order_id	order_amount
John	1	100
John	2	150

🧪 Real-World Story — Arjun Queries SaaS Data

SELECT
    f.value:eventType AS event_type,
    COUNT(*) AS total_events
FROM raw_data,
LATERAL FLATTEN(input => data:events) f
GROUP BY event_type
ORDER BY total_events DESC;

Input: VARIANT object containing millions of event records.

Example small sample:

{
  "events": [
    { "eventType": "login" },
    { "eventType": "purchase" },
    { "eventType": "logout" }
  ]
}

Output (example):

event_type	total_events
login	20000000
purchase	15000000
logout	10000000

💡 Best Practices

Use VARIANT for raw storage — flexible and efficient
Flatten only when needed — avoid unnecessary row explosion
Filter early to reduce scanned micro-partitions
Use materialized views for repeated queries
Cast fields using ::TYPE when required

📘 Summary

Snowflake simplifies handling semi-structured data:

Store JSON, XML, Avro directly in VARIANT
Query with dot notation and FLATTEN
Nested and dynamic fields supported
No preprocessing or schema migration needed

Handling semi-structured data becomes fast, scalable, and simple.

👉 Next Topic

Flatten Function & VARIANT Type — Real Use Cases

✨ Story Time — “How Do I Query JSON in Snowflake?”​

🧱 Step 1: Understanding Snowflake VARIANT Type​

Example:​

2️⃣ Loading Semi-Structured Data​

JSON Example:​

XML Example:​

Avro Example:​

3️⃣ Querying JSON / XML / Avro​

JSON Example:​

XML Example:​

Avro Example:​

4️⃣ Flattening Nested Arrays​

🧪 Real-World Story — Arjun Queries SaaS Data​

💡 Best Practices​

📘 Summary​

👉 Next Topic

✨ Story Time — “How Do I Query JSON in Snowflake?”

🧱 Step 1: Understanding Snowflake VARIANT Type

Example:

2️⃣ Loading Semi-Structured Data

JSON Example:

XML Example:

Avro Example:

3️⃣ Querying JSON / XML / Avro

JSON Example:

XML Example:

Avro Example:

4️⃣ Flattening Nested Arrays

🧪 Real-World Story — Arjun Queries SaaS Data

💡 Best Practices

📘 Summary