Muokkaa

Jaa


Indexing blobs and files to produce multiple search documents

Applies to: Blob indexers, File indexers

By default, an indexer treats the contents of a blob or file as a single search document. If you want a more granular representation in a search index, you can set parsingMode values to create multiple search documents from one blob or file. The parsingMode values that result in many search documents include delimitedText (for CSV), and jsonArray or jsonLines (for JSON).

When you use any of these parsing modes, the new search documents that emerge must have unique document keys, and a problem arises in determining where that value comes from. The parent blob has at least one unique value in the form of metadata_storage_path property, but if it contributes that value to more than one search document, the key is no longer unique in the index.

To address this problem, the blob indexer generates an AzureSearch_DocumentKey that uniquely identifies each child search document created from the single blob parent. This article explains how this feature works.

One-to-many document key

Each document in an index is uniquely identified by a document key. When no parsing mode is specified, and if there's no explicit field mapping in the indexer definition for the search document key, the blob indexer automatically maps the metadata_storage_path property as the document key. This default mapping ensures that each blob appears as a distinct search document, and it saves you the step of having to create this field mapping yourself (normally, only fields having identical names and types are automatically mapped).

In a one-to-many search document scenario, an implicit document key based on metadata_storage_path property isn't possible. As a workaround, Azure AI Search can generate a document key for each individual entity extracted from a blob. The generated key is named AzureSearch_DocumentKey and it's added to each search document. The indexer keeps track of the "many documents" created from each blob, and can target updates to the search index when source data changes over time.

By default, when no explicit field mappings for the key index field are specified, the AzureSearch_DocumentKey is mapped to it, using the base64Encode field-mapping function.

Example

Assume an index definition with the following fields:

  • id
  • temperature
  • pressure
  • timestamp

And your blob container has blobs with the following structure:

Blob1.json

{ "temperature": 100, "pressure": 100, "timestamp": "2024-02-13T00:00:00Z" }
{ "temperature" : 33, "pressure" : 30, "timestamp": "2024-02-14T00:00:00Z" }

Blob2.json

{ "temperature": 1, "pressure": 1, "timestamp": "2023-01-12T00:00:00Z" }
{ "temperature" : 120, "pressure" : 3, "timestamp": "2022-05-11T00:00:00Z" }

When you create an indexer and set the parsingMode to jsonLines - without specifying any explicit field mappings for the key field, the following mapping is applied implicitly.

{
    "sourceFieldName" : "AzureSearch_DocumentKey",
    "targetFieldName": "id",
    "mappingFunction": { "name" : "base64Encode" }
}

This setup results in disambiguated document keys, similar to the following illustration (base64-encoded ID shortened for brevity).

ID temperature pressure timestamp
aHR0 ... YjEuanNvbjsx 100 100 2024-02-13T00:00:00Z
aHR0 ... YjEuanNvbjsy 33 30 2024-02-14T00:00:00Z
aHR0 ... YjIuanNvbjsx 1 1 2023-01-12T00:00:00Z
aHR0 ... YjIuanNvbjsy 120 3 2022-05-11T00:00:00Z

Custom field mapping for index key field

Assuming the same index definition as the previous example, suppose your blob container has blobs with the following structure:

Blob1.json

recordid, temperature, pressure, timestamp
1, 100, 100,"2024-02-13T00:00:00Z" 
2, 33, 30,"2024-02-14T00:00:00Z" 

Blob2.json

recordid, temperature, pressure, timestamp
1, 1, 1,"20123-01-12T00:00:00Z" 
2, 120, 3,"2022-05-11T00:00:00Z" 

When you create an indexer with delimitedText parsingMode, it might feel natural to set up a field-mapping function to the key field as follows:

{
    "sourceFieldName" : "recordid",
    "targetFieldName": "id"
}

However, this mapping won't result in four documents showing up in the index because the recordid field isn't unique across blobs. Hence, we recommend you to make use of the implicit field mapping applied from the AzureSearch_DocumentKey property to the key index field for "one-to-many" parsing modes.

If you do want to set up an explicit field mapping, make sure that the sourceField is distinct for each individual entity across all blobs.

Note

The approach used by AzureSearch_DocumentKey of ensuring uniqueness per extracted entity is subject to change and therefore you should not rely on it's value for your application's needs.

Specify the index key field in your data

Assuming the same index definition as the previous example and parsingMode is set to jsonLines without specifying any explicit field mappings so the mappings look like in the first example, suppose your blob container has blobs with the following structure:

Blob1.json

id, temperature, pressure, timestamp
1, 100, 100,"2024-02-13T00:00:00Z" 
2, 33, 30,"2024-02-14T00:00:00Z"

Blob2.json

id, temperature, pressure, timestamp
1, 1, 1,"2023-01-12T00:00:00Z" 
2, 120, 3,"2022-05-11T00:00:00Z" 

Notice that each document contains the id field, which is defined as the key field in the index. In such a case, even though a document-unique AzureSearch_DocumentKey will be generated, it won't be used as the "key" for the document. Rather, the value of the id field will be mapped to the key field

Similar to the previous example, this mapping won't result in four documents showing up in the index because the id field isn't unique across blobs. When this is the case, any json entry that specifies an id will result in a merge on the existing document instead of an upload of a new document, and the state of the index will reflect the latest read entry with the specified id.

Next steps

If you aren't already familiar with the basic structure and workflow of blob indexing, you should review Indexing Azure Blob Storage with Azure AI Search first. For more information about parsing modes for different blob content types, review the following articles.