Model semi-structured data

Artikkeli
09/04/2024

This article recommends patterns for storing semi-structured data depending on how your organization uses the data. Azure Databricks provides functions, native data types, and query syntax to work with semi-structured, nested, and complex data.

The following considerations impact which pattern you should use:

Do the fields or types in the data source change frequently?
How many total unique fields are contained in the data source?
Do you need to optimize your workloads for writes or reads?

Databricks recommends storing data as Delta tables for downstream queries.

Use variant

In Databricks Runtime 15.3 and above, you can use the VARIANT type to store semi-structured JSON data using an optimized encoding that outperforms JSON strings for reads and writes.

The VARIANT type has similar applications as JSON strings. Some workloads still benefit from using structs, maps, and arrays, especially for data with well-known schemas that would benefit from optimized data layout and stats collection.

For more details, see the following articles:

Use JSON strings

You can store data in a single string column using standard JSON formatting and then query fields in the JSON using : notation.

Many systems output records as string or byte-encoded JSON records. Ingesting and storing these records as strings has very low processing overhead. You can also use the to_json function to turn any struct of data into a JSON string.

Consider the following strengths and weaknesses when choosing to store data as JSON strings:

All values are stored as strings without type information.
JSON supports all data types that can be represented using text.
JSON supports strings of arbitrary length.
There are no limits on the number of fields that can be represented in a single JSON data column.
Data requires no pre-processing before writing to the table.
You can resolve type issues present in the data in downstream workloads.
JSON provides the worst performance on read, as you must parse the entire string for every query.

JSON strings provide great flexibility and an easy-to-implement solution for getting raw data into a lakehouse table. You might choose to use JSON strings for many applications, but they are especially useful when the most important outcome of a workload is storing a complete and accurate representation of a data source for downstream processing. Some use cases might include:

Ingesting streaming data from a queue service such as Kafka.
Recording responses REST API queries.
Storing raw records from an upstream data source not controlled by your team.

Assuming your ingestion logic is flexible, storing data as a JSON string should be resilient even if you encounter new fields, changes in data structure, or type changes in the data source. While downstream workloads might fail due to these changes, your table contains a full history of the source data, meaning that you can remediate issues without needing to go back to the data source.

Use structs

You can store semi-structured data with structs and enable all native functionality of columns while maintaining the nested structure of the data source.

Delta Lake treats data stored as structs the same as any other columns, meaning that there is no functional difference from structs and columns. The Parquet data files used by Delta Lake create a column for each field in a struct. You can use struct fields as clustering columns or partitioning columns, and you can collect statistics on structs for data skipping.

Structs generally provide the best performance on read, as they support all data skipping optimizations and store individual fields as columns. Performance can begin to suffer when the number of columns present gets into the hundreds.

Each field in a struct has a data type, which is enforced on write the same as columns. As such, structs require full pre-processing of data. This can be beneficial when you only want validated data committed to a table, but can lead to dropped data or failing jobs when processing malformed records from upstream systems.

Structs are less flexible than JSON streams for schema evolution, whether this is for evolving data types or adding new fields.

Use maps and arrays

You can use a combination of maps and arrays to replicate semi-structured data formats natively in Delta Lake. Statistics cannot be collected on fields defined with these types, but they provide balanced performance on both read and write for semi-structured datasets that have around 500 fields.

Both the key and value of maps are typed, so data is pre-processed and schema is enforced on write.

To accelerate queries, Databricks recommends storing fields that are often used to filter data as separate columns.

Do I need to flatten my data?

If you are storing your data using JSON or maps, consider storing fields frequently used for filtering queries as columns. Stats collection, partitioning, and clustering are not available for fields within JSON strings or maps. You do not need to do this for data stored as structs.

Syntax for working with nested data

Review the following resources for information on working with nested data:

Jaa