Why VARIANT Outperforms STRING for JSON Data

When VARIANT was introduced in Databricks, it quickly became an excellent solution for handling JSON schema evolution challenges. However, more than a year later, I'm surprised to see many engineers still storing JSON data as simple STRING data types in their bronze layer.

When I discussed this with engineering teams, they explained that their schemas are stable and they don't need VARIANT's flexibility for schema evolution. This conversation inspired me to benchmark the additional benefits that VARIANT offers beyond schema flexibility, specifically in terms of storage efficiency and query performance.

The Methodology

I created two identical tables with 10 million rows each: one storing JSON as VARIANT and another as STRING. The results? VARIANT consumed 22% less storage space than STRING, as shown in the comparison below:

Here’s the script, which I used to create the test tables. For the VARIANT table you just need to add parse_json() function:

-- Synthetic dataset (as JSON string) in a temp view
CREATE OR REPLACE TEMP VIEW v_payload_json AS
SELECT
  to_json(
    named_struct(
      'user',
      named_struct(
        'id',
        concat('u', cast(rand(3) * 10000000 as int)),
        'country',
        element_at(
          array('US', 'PL', 'IT', 'DE', 'KR', 'JP', 'BR', 'IN', 'GB', 'FR'),
          cast(rand(2) * 10 as int) + 1
        ),
        'age',
        cast(rand(4) * 70 as int)
      ),
      'metrics',
      named_struct('clicks', cast(rand(5) * 100 as int), 'spent', round(rand(6) * 1000, 2)),
      'flags',
      named_struct('is_paid', rand(7) < 0.35, 'is_new', rand(8) < 0.5),
      'items',
      transform(
        sequence(1, greatest(1, cast(rand(9) * 5 as int))),
        i -> named_struct(
          'sku',
          concat('sku_', id, '_', i),
          'price',
          round(rand(10) * 200, 2),
          'qty',
          cast(rand(11) * 5 as int)
        )
      ),
      'extra',
      map(
        'ua',
        element_at(array('Chrome', 'Safari', 'Firefox', 'Edge'), cast(rand(12) * 4 as int) + 1),
        'campaign',
        element_at(array('A', 'B', 'C', 'D'), cast(rand(13) * 4 as int) + 1)
      )
    )
  ) AS payload_json
FROM
  range(10000000); -- scale factor
  
  -- JSON string table (one column)
CREATE TABLE json_str AS
SELECT payload_json AS payload
FROM v_payload_json;

-- VARIANT table (one column) — same content, parsed once
CREATE TABLE json_variant AS
SELECT parse_json(payload_json) AS payload
FROM v_payload_json;

The time to create both tables was identical. The cache was disabled for both tests (SET use_cached_result = false;) and I repeated each test multiple times on a SQL Warehouse to ensure consistent results.

Performance Test Results

Test 1 - Filter on a nested value

SELECT count(*) AS users_25_35
FROM json_str|json_variant
WHERE payload:user.age::int BETWEEN 25 AND 35;

Result: VARIANT was 2x faster for filtering operations on nested JSON values.

Test 2 - Group by nested string (country), and aggregate nested int (clicks)

CREATE OR REPLACE TABLE agg_country_clicks AS
SELECT payload:user.country::string AS country,
       SUM(payload:metrics.clicks::int) AS clicks
FROM json_str|json_variant
GROUP BY country
ORDER BY clicks DESC;

Result: VARIANT performed 50% faster for complex aggregation queries involving nested data.

Test 3 - Top‑N query by a nested numeric value

CREATE OR REPLACE TABLE top_users_spent AS
SELECT payload:user.id::string AS user_id,
       payload:metrics.spent::double AS spent
FROM json_str|json_variant
ORDER BY spent DESC;

Result: VARIANT showed 44% better performance for sorting operations on nested numeric values.

My Findings

The benchmark results show that VARIANT consistently outperforms STRING for JSON operations, delivering up to 50% faster query performance without requiring any code changes. Here's the complete performance summary:

  • Storage: 22% more efficient storage consumption

  • Filtering: 2x faster for nested value filters

  • Aggregation: 50% faster for complex group-by operations

  • Sorting: 44% faster for top-N queries on nested data

Conclusion

Even if your JSON schema is stable and you don't need VARIANT's schema evolution capabilities, the storage and performance benefits alone make it a compelling choice for any JSON workload in Databricks.

So, to all data engineering teams working with JSON data in Databricks: VARIANT is not just for handling schema changes, it can also optimize your entire data pipeline for better performance and lower storage costs.

Hubert Dudek

Databricks MVP | Advisor to Databricks Product Board and Technical advisor to SunnyData

https://www.linkedin.com/in/hubertdudek/
Next
Next

The Hidden Benefits of Databricks' Managed Tables