Rediger

Del via


Query nested types in Parquet and JSON files by using serverless SQL pool in Azure Synapse Analytics

In this article, you'll learn how to write a query by using serverless SQL pool in Azure Synapse Analytics. The query will read Parquet nested types. Nested types are complex structures that represent objects or arrays. Nested types can be stored in:

  • Parquet, where you can have multiple complex columns that contain arrays and objects.
  • Hierarchical JSON files, where you can read a complex JSON document as a single column.
  • Azure Cosmos DB collections (currently under gated public preview), where every document can contain complex nested properties.

Serverless SQL pool formats all nested types as JSON objects and arrays. So you can extract or modify complex objects by using JSON functions or parse JSON data by using the OPENJSON function.

Here's an example of a query that extracts scalar and object values from the COVID-19 Open Research Dataset JSON file, which contains nested objects:

SELECT
    title = JSON_VALUE(doc, '$.metadata.title'),
    first_author = JSON_QUERY(doc, '$.metadata.authors[0]'),
    first_author_name = JSON_VALUE(doc, '$.metadata.authors[0].first'),
    complex_object = doc
FROM
    OPENROWSET(
        BULK 'https://azureopendatastorage.blob.core.windows.net/covid19temp/comm_use_subset/pdf_json/000b7d1517ceebb34e1e3e817695b6de03e2fa78.json',
        FORMAT='CSV', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b'
    )
    WITH ( doc varchar(MAX) ) AS docs;

The JSON_VALUE function returns a scalar value from the field at the specified path. The JSON_QUERY function returns an object formatted as JSON from the field at the specified path.

Important

This example uses a file from the COVID-19 Open Research Dataset. See the license and the structure of the data here.

Prerequisites

The first step is to create a database where the datasource will be created. You'll then initialize the objects by running a setup script on the database. The setup script will create the data sources, database-scoped credentials, and external file formats that are used in the samples.

Project nested or repeated data

A Parquet file can have multiple columns with complex types. The values from these columns are formatted as JSON text and returned as VARCHAR columns. The following query reads the structExample.parquet file and shows how to read the values of the nested columns:

SELECT
    DateStruct, TimeStruct, TimestampStruct, DecimalStruct, FloatStruct
FROM
    OPENROWSET(
        BULK 'parquet/nested/structExample.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    )
    WITH (
        DateStruct VARCHAR(8000),
        TimeStruct VARCHAR(8000),
        TimestampStruct VARCHAR(8000),
        DecimalStruct VARCHAR(8000),
        FloatStruct VARCHAR(8000)
    ) AS [r];

This query returns the following result. The content of every nested object is returned as JSON text.

DateStruct TimeStruct TimestampStruct DecimalStruct FloatStruct
{"Date":"2009-04-25"} {"Time":"20:51:54.3598000"} {"Timestamp":"5501-04-08 12:13:57.4821000"} {"Decimal":11143412.25350} {"Float":0.5}
{"Date":"1916-04-29"} {"Time":"00:16:04.6778000"} {"Timestamp":"1990-06-30 20:50:52.6828000"} {"Decimal":1963545.62800} {"Float":-2.125}

The following query reads the justSimpleArray.parquet file. It projects all columns from the Parquet file, including nested and repeated data.

SELECT
    SimpleArray
FROM
    OPENROWSET(
        BULK 'parquet/nested/justSimpleArray.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    ) AS [r];

This query will return the following result:

SimpleArray
[11,12,13]
[21,22,23]

Read properties from nested object columns

The JSON_VALUE function enables you to return values from columns formatted as JSON text:

SELECT
    title = JSON_VALUE(complex_column, '$.metadata.title'),
    first_author_name = JSON_VALUE(complex_column, '$.metadata.authors[0].first'),
    body_text = JSON_VALUE(complex_column, '$.body_text.text'),
    complex_column
FROM
    OPENROWSET( BULK 'https://azureopendatastorage.blob.core.windows.net/covid19temp/comm_use_subset/pdf_json/000b7d1517ceebb34e1e3e817695b6de03e2fa78.json',
                FORMAT='CSV', FIELDTERMINATOR ='0x0b', FIELDQUOTE = '0x0b', ROWTERMINATOR = '0x0b' ) WITH ( complex_column varchar(MAX) ) AS docs;

The result is shown in the following table:

title first_author_name body_text complex_column
Supplementary Information An eco-epidemiolo... Julien - Figure S1 : Phylogeny of... { "paper_id": "000b7d1517ceebb34e1e3e817695b6de03e2fa78", "metadata": { "title": "Supplementary Information An eco-epidemiological study of Morbilli-related paramyxovirus infection in Madagascar bats reveals host-switching as the dominant macro-evolutionary mechanism", "authors": [ { "first": "Julien"

Unlike JSON files, which in most cases return a single column that contains a complex JSON object, Parquet files can have multiple complex columns. You can read the properties of nested columns by using the JSON_VALUE function on each column. OPENROWSET enables you to directly specify the paths of the nested properties in a WITH clause. You can set the paths as the name of a column, or you can add a JSON path expression after the column type.

The following query reads the structExample.parquet file and shows how to surface elements of a nested column. There are two ways to reference a nested value:

  • By specifying the nested value path expression after the type specification.
  • By formatting the column name as a nested path by using do "." to reference the fields.
SELECT
    *
FROM
    OPENROWSET(
        BULK 'parquet/nested/structExample.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    )
    WITH (
        [DateValue] DATE '$.DateStruct.Date',
        [TimeStruct.Time] TIME,
        [TimestampStruct.Timestamp] DATETIME2,
        DecimalValue DECIMAL(18, 5) '$.DecimalStruct.Decimal',
        [FloatStruct.Float] FLOAT
    ) AS [r];

Access elements from repeated columns

The following query reads the justSimpleArray.parquet file and uses JSON_VALUE to retrieve a scalar element from within a repeated column, like an array or map:

SELECT
    *,
    JSON_VALUE(SimpleArray, '$[0]') AS FirstElement,
    JSON_VALUE(SimpleArray, '$[1]') AS SecondElement,
    JSON_VALUE(SimpleArray, '$[2]') AS ThirdElement
FROM
    OPENROWSET(
        BULK 'parquet/nested/justSimpleArray.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    ) AS [r];

Here's the result:

SimpleArray FirstElement SecondElement ThirdElement
[11,12,13] 11 12 13
[21,22,23] 21 22 23

Access sub-objects from complex columns

The following query reads the mapExample.parquet file and uses JSON_QUERY to retrieve a non-scalar element from within a repeated column, like an array or map:

SELECT
    MapOfPersons,
    JSON_QUERY(MapOfPersons, '$."John Doe"') AS [John]
FROM
    OPENROWSET(
        BULK 'parquet/nested/mapExample.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    ) AS [r];

You can also explicitly reference the columns that you want to return in a WITH clause:

SELECT DocId,
    MapOfPersons,
    JSON_QUERY(MapOfPersons, '$."John Doe"') AS [John]
FROM
    OPENROWSET(
        BULK 'parquet/nested/mapExample.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    ) 
    WITH (DocId bigint, MapOfPersons VARCHAR(max)) AS [r];

The structure MapOfPersons is returned as a VARCHAR column and formatted as a JSON string.

Project values from repeated columns

If you have an array of scalar values (for example [1,2,3]) in some columns, you can easily expand them and join them with the main row by using this script:

SELECT
    SimpleArray, Element
FROM
    OPENROWSET(
        BULK 'parquet/nested/justSimpleArray.parquet',
        DATA_SOURCE = 'SqlOnDemandDemo',
        FORMAT='PARQUET'
    ) AS arrays
    CROSS APPLY OPENJSON (SimpleArray) WITH (Element int '$') as array_values

Next steps

The next article will show you how to Query JSON files.