How is variant different than JSON strings?

Important

This feature is in Public Preview.

This article describes the behavior changes and differences in syntax and semantics when working with the variant data type. This article assumes that you are familiar with working with JSON string data on Azure Databricks. For users new to Azure Databricks, you should use variant over JSON strings whenever storing semi-structured data that requires flexibility for changing or unknown schema. See Model semi-structured data.

In Databricks Runtime 15.3 and above, you can use the variant data type to encode and query semi-structured data. Databricks recommends variant as a replacement for storing semi-structured data using JSON strings. The improved read and write performance for variant allows it to replace native Spark complex types such as structs and arrays in some use cases.

How do you query variant data?

Variant data uses the same operators to query fields, subfields, and array elements.

To query a field, use :. For example, column_name:field_name.

To query a subfield, use .. For example, column_name:field_name.subfield_name.

To query an array element, use [n] where n is the integer index value of the element. For example, to query the first value in an array, column_name:array_name[0].

The following differences might break existing queries when upgrading from JSON strings to variant:

  • All variant path elements are matched in a case-sensitive way. JSON strings are case-insensitive. This means that for variant, column_name:FIELD_NAME and column_name:field_name look for different fields in the stored data.
  • The [*] syntax is not support for identifying or unpacking all elements in an array.
  • Variant encodes NULL values differently than JSON strings. See Variant null rules.
  • Variant columns have limitations for some operations. See Limitations.

Convert JSON strings to and from variant

In Databricks Runtime 15.3 and above, the to_json function has additional functionality to cast VARIANT types to JSON strings. Options are ignored when converting VARIANT to JSON string. See to_json.

The parse_json function transforms a JSON string to VARIANT type. While parse_json(json_string_column) is the logical inverse of to_json(variant_column), the following conversion rules describe why it is not the exact inverse:

  • Whitespace is not perfectly preserved.
  • Ordering of keys is arbitrary.
  • Trailing zeros in numbers might be truncated.

The parse_json function returns an error if the JSON string is malformed or exceeds the variant size limit. Use the try_parse_json function to instead return a NULL when an error in parsing occurs.

What are the SQL functions for working with variants?

Apache Spark SQL functions available in Databricks Runtime 15.3 and above provide methods for interacting with variant data. The following table includes the new function, the corresponding JSON string function, and notes on differences in behavior.

Note

To use these functions with PySpark DataFrames, import them from pyspark.sql.functions. variant_explode and variant_explode_outer are not supported in PySpark.

Variant function JSON string function Notes
variant_get cast and get_json_object Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls.
try_variant_get try_cast and get_json_object Takes an expression, path, and type. Follows all rules for variants paths, casting, and nulls.
is_variant_null is null Checks whether the expression is storing a VARIANT encoded NULL. Use is null to check if the input expression is NULL.
schema_of_variant schema_of_json When determining the schema for an ARRAY<elementType>, the elementType might be inferred as VARIANT if there are conflicting types found in the data.
schema_of_variant_agg schema_of_json_agg When no least common type is identified, the type is derived as VARIANT.
variant_explode explode Outputs pos, key, and value columns. When exploding an array, the output key is always null.
variant_explode_outer explode_outer Outputs pos, key, and value columns. When exploding an array, the output key is always null.

Variants handle casting and NULLs differently than JSON strings. See Variant type casting rules and Variant null rules.