Επεξεργασία

Κοινή χρήση μέσω


Index data from Azure Cosmos DB for NoSQL for queries in Azure AI Search

In this article, learn how to configure an indexer that imports content from Azure Cosmos DB for NoSQL and makes it searchable in Azure AI Search.

This article supplements Create an indexer with information that's specific to Cosmos DB. It uses the Azure portal and REST APIs to demonstrate a three-part workflow common to all indexers: create a data source, create an index, create an indexer. Data extraction occurs when you submit the Create Indexer request.

Because terminology can be confusing, it's worth noting that Azure Cosmos DB indexing and Azure AI Search indexing are different operations. Indexing in Azure AI Search creates and loads a search index on your search service.

Prerequisites

To work through the examples in this article, you need the Azure portal or a REST client. If you're using Azure portal, make sure that access to all public networks is enabled. Other approaches for creating a Cosmos DB indexer include Azure SDKs.

Try with sample data

Use these instructions to create a container and database in Cosmos DB for testing purposes.

  1. Download HotelsData_toCosmosDB.JSON from GitHub to create a container in Cosmos DB that contains a subset of the sample hotels data set.

  2. Sign in to the Azure portal and create an account, database, and container on Cosmos DB.

  3. In Cosmos DB, select Data Explorer for the new container, provide the following values.

    Property Value
    Database Create new
    Database ID hotelsdb
    Share throughput across containers Don't select
    Container ID hotels
    Partition key /HotelId
    Container throughput (autoscale) Autoscale
    Container Max RU/s 1000
  4. In Data Explorer, expand hotelsdb and *hotels", and then select Items.

  5. Select Upload Item and then select HotelsData_toCosmosDB.JSON file that you downloaded from GitHub.

  6. Right-click Items and select New SQL query. The default query is SELECT * FROM c.

  7. Select Execute query to run the query and view results. You should have 50 hotel documents.

Now that you have a container, you can use the Azure portal, REST client, or an Azure SDK to index your data.

The Description field provides the most verbose content. You should target this field for full text search and optional vector queries.

Use the Azure portal

You can use either the Import data wizard or Import and vectorize data wizard to automate indexing from an SQL database table or view. The data source configuration is similar for both wizards.

  1. Start the wizard.

  2. On Connect to your data, select or verify that the data source type is either Azure Cosmos DB or a NoSQL account.

    The data source name refers to the data source connection object in Azure AI Search. If you use the vector wizard, your data source name is autogenerated using a custom prefix specified at the end of the wizard workflow.

  3. Specify the database name and collection. The query is optional. It's useful if you have hierarchical data and you want to import a specific slice.

  4. Specify an authentication method, either a managed identity or built-in API key. If you don't specify a managed identity connection, the Azure portal uses the key.

    If you configure Azure AI Search to use a managed identity, and you create a role assignment on Cosmos DB that grants Cosmos DB Account Reader and Cosmos DB Built-in Data Reader permissions to the identity, your indexer can connect to Cosmos DB using Microsoft Entra ID and roles.

  5. For the Import and vectorize data wizard, you can specify options for change and deletion tracking.

    Change detection is supported by default through a _ts field (timestamp). If you upload content using the approach described in Try with sample data, the collection is created with a _ts field.

    Deletion detection requires that you have a pre-existing top-level field in the collection that can be used as a soft-delete flag. It should be a Boolean field (you could name it IsDeleted). Specify true as the soft-delete value. In the search index, add a corresponding search field called IsDeleted set to retrievable and filterable.

  6. Continue with the remaining steps to complete the wizard:

Use the REST APIs

This section demonstrates the REST API calls that create a data source, index, and indexer.

Define the data source

The data source definition specifies the data to index, credentials, and policies for identifying changes in the data. A data source is an independent resource that can be used by multiple indexers.

  1. Create or update a data source to set its definition:

    POST https://[service name].search.windows.net/datasources?api-version=2024-07-01
    Content-Type: application/json
    api-key: [Search service admin key]
    {
        "name": "[my-cosmosdb-ds]",
        "type": "cosmosdb",
        "credentials": {
          "connectionString": "AccountEndpoint=https://[cosmos-account-name].documents.azure.com;AccountKey=[cosmos-account-key];Database=[cosmos-database-name]"
        },
        "container": {
          "name": "[my-cosmos-db-collection]",
          "query": null
        },
        "dataChangeDetectionPolicy": {
          "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
        "  highWaterMarkColumnName": "_ts"
        },
        "dataDeletionDetectionPolicy": null,
        "encryptionKey": null,
        "identity": null
    }
    
  2. Set "type" to "cosmosdb" (required). If you're using an older Search API version 2017-11-11, the syntax for "type" is "documentdb". Otherwise, for 2019-05-06 and later, use "cosmosdb".

  3. Set "credentials" to a connection string. The next section describes the supported formats.

  4. Set "container" to the collection. The "name" property is required and it specifies the ID of the database collection to be indexed. The "query" property is optional. Use it to flatten an arbitrary JSON document into a flat schema that Azure AI Search can index.

  5. Set "dataChangeDetectionPolicy" if data is volatile and you want the indexer to pick up just the new and updated items on subsequent runs.

  6. Set "dataDeletionDetectionPolicy" if you want to remove search documents from a search index when the source item is deleted.

Supported credentials and connection strings

Indexers can connect to a collection using the following connections.

Avoid port numbers in the endpoint URL. If you include the port number, the connection fails.

Full access connection string
{ "connectionString" : "AccountEndpoint=https://<Cosmos DB account name>.documents.azure.com;AccountKey=<Cosmos DB auth key>;Database=<Cosmos DB database id>" }`
You can get the connection string from the Azure Cosmos DB account page in the Azure portal by selecting Keys in the left navigation pane. Make sure to select a full connection string and not just a key.
Managed identity connection string
{ "connectionString" : "ResourceId=/subscriptions/<your subscription ID>/resourceGroups/<your resource group name>/providers/Microsoft.DocumentDB/databaseAccounts/<your cosmos db account name>/;(ApiKind=[api-kind];)/(IdentityAuthType=[identity-auth-type])" }
This connection string doesn't require an account key, but you must have a search service that can connect using a managed identity. For connections targeting the SQL API, you can omit ApiKind from the connection string. For more information about ApiKind, IdentityAuthType see Setting up an indexer connection to an Azure Cosmos DB database using a managed identity.

Using queries to shape indexed data

In the "query" property under "container", you can specify a SQL query to flatten nested properties or arrays, project JSON properties, and filter the data to be indexed.

Example document:

    {
        "userId": 10001,
        "contact": {
            "firstName": "andy",
            "lastName": "hoh"
        },
        "company": "microsoft",
        "tags": ["azure", "cosmosdb", "search"]
    }

Filter query:

SELECT * FROM c WHERE c.company = "microsoft" and c._ts >= @HighWaterMark ORDER BY c._ts

Flattening query:

SELECT c.id, c.userId, c.contact.firstName, c.contact.lastName, c.company, c._ts FROM c WHERE c._ts >= @HighWaterMark ORDER BY c._ts

Projection query:

SELECT VALUE { "id":c.id, "Name":c.contact.firstName, "Company":c.company, "_ts":c._ts } FROM c WHERE c._ts >= @HighWaterMark ORDER BY c._ts

Array flattening query:

SELECT c.id, c.userId, tag, c._ts FROM c JOIN tag IN c.tags WHERE c._ts >= @HighWaterMark ORDER BY c._ts

Unsupported queries (DISTINCT and GROUP BY)

Queries using the DISTINCT keyword or GROUP BY clause aren't supported. Azure AI Search relies on SQL query pagination to fully enumerate the results of the query. Neither the DISTINCT keyword or GROUP BY clauses are compatible with the continuation tokens used to paginate results.

Examples of unsupported queries:

SELECT DISTINCT c.id, c.userId, c._ts FROM c WHERE c._ts >= @HighWaterMark ORDER BY c._ts

SELECT DISTINCT VALUE c.name FROM c ORDER BY c.name

SELECT TOP 4 COUNT(1) AS foodGroupCount, f.foodGroup FROM Food f GROUP BY f.foodGroup

Although Azure Cosmos DB has a workaround to support SQL query pagination with the DISTINCT keyword by using the ORDER BY clause, it isn't compatible with Azure AI Search. The query returns a single JSON value, whereas Azure AI Search expects a JSON object.

-- The following query returns a single JSON value and isn't supported by Azure AI Search
SELECT DISTINCT VALUE c.name FROM c ORDER BY c.name

Add search fields to an index

In a search index, add fields to accept the source JSON documents or the output of your custom query projection. Ensure that the search index schema is compatible with source data. For content in Azure Cosmos DB, your search index schema should correspond to the Azure Cosmos DB items in your data source.

  1. Create or update an index to define search fields that store data:

    POST https://[service name].search.windows.net/indexes?api-version=2024-07-01
    Content-Type: application/json
    api-key: [Search service admin key]
    {
        "name": "mysearchindex",
        "fields": [{
            "name": "rid",
            "type": "Edm.String",
            "key": true,
            "searchable": false
        }, 
        {
            "name": "description",
            "type": "Edm.String",
            "filterable": false,
            "searchable": true,
            "sortable": false,
            "facetable": false,
            "suggestions": true
        }
      ]
    }
    
  2. Create a document key field ("key": true). For partitioned collections, the default document key is the Azure Cosmos DB _rid property, which Azure AI Search automatically renames to rid because field names can’t start with an underscore character. Also, Azure Cosmos DB _rid values contain characters that are invalid in Azure AI Search keys. For this reason, the _rid values are Base64 encoded.

  3. Create more fields for more searchable content. See Create an index for details.

Mapping data types

JSON data types Azure AI Search field types
Bool Edm.Boolean, Edm.String
Numbers that look like integers Edm.Int32, Edm.Int64, Edm.String
Numbers that look like floating-points Edm.Double, Edm.String
String Edm.String
Arrays of primitive types such as ["a", "b", "c"] Collection(Edm.String)
Strings that look like dates Edm.DateTimeOffset, Edm.String
GeoJSON objects such as { "type": "Point", "coordinates": [long, lat] } Edm.GeographyPoint
Other JSON objects N/A

Configure and run the Azure Cosmos DB for NoSQL indexer

Once the index and data source have been created, you're ready to create the indexer. Indexer configuration specifies the inputs, parameters, and properties controlling run time behaviors.

  1. Create or update an indexer by giving it a name and referencing the data source and target index:

    POST https://[service name].search.windows.net/indexers?api-version=2024-07-01
    Content-Type: application/json
    api-key: [search service admin key]
    {
        "name" : "[my-cosmosdb-indexer]",
        "dataSourceName" : "[my-cosmosdb-ds]",
        "targetIndexName" : "[my-search-index]",
        "disabled": null,
        "schedule": null,
        "parameters": {
            "batchSize": null,
            "maxFailedItems": 0,
            "maxFailedItemsPerBatch": 0,
            "base64EncodeKeys": false,
            "configuration": {}
            },
        "fieldMappings": [],
        "encryptionKey": null
    }
    
  2. Specify field mappings if there are differences in field name or type, or if you need multiple versions of a source field in the search index.

  3. See Create an indexer for more information about other properties.

An indexer runs automatically when it's created. You can prevent this by setting "disabled" to true. To control indexer execution, run an indexer on demand or put it on a schedule.

Check indexer status

To monitor the indexer status and execution history, check the indexer execution history in the Azure portal, or send a Get Indexer Status REST APIrequest

  1. On the search service page, open Search management > Indexers.

  2. Select an indexer to access configuration and execution history.

  3. Select a specific indexer job to view details, warnings, and errors.

Execution history contains up to 50 of the most recently completed executions, which are sorted in the reverse chronological order so that the latest execution comes first.

Indexing new and changed documents

Once an indexer has fully populated a search index, you might want subsequent indexer runs to incrementally index just the new and changed documents in your database.

To enable incremental indexing, set the "dataChangeDetectionPolicy" property in your data source definition. This property tells the indexer which change tracking mechanism is used on your data.

For Azure Cosmos DB indexers, the only supported policy is the HighWaterMarkChangeDetectionPolicy using the _ts (timestamp) property provided by Azure Cosmos DB.

The following example shows a data source definition with a change detection policy:

"dataChangeDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
"  highWaterMarkColumnName": "_ts"
},

Note

When you assign a null value to a field in your Azure Cosmos DB, the AI Search indexer is unable to distinguish between null and a missing field value. Therefore, if a field in the index is empty, it will not be substituted with a null value, even if that modification was specifically made in your database.

Incremental indexing and custom queries

If you're using a custom query to retrieve documents, make sure the query orders the results by the _ts column. This enables periodic check-pointing that Azure AI Search uses to provide incremental progress in the presence of failures.

In some cases, even if your query contains an ORDER BY [collection alias]._ts clause, Azure AI Search might not infer that the query is ordered by the _ts. You can tell Azure AI Search that results are ordered by setting the assumeOrderByHighWaterMarkColumn configuration property.

To specify this hint, create or update your indexer definition as follows:

{
    ... other indexer definition properties
    "parameters" : {
        "configuration" : { "assumeOrderByHighWaterMarkColumn" : true } }
} 

Indexing deleted documents

When rows are deleted from the collection, you normally want to delete those rows from the search index as well. The purpose of a data deletion detection policy is to efficiently identify deleted data items. Currently, the only supported policy is the Soft Delete policy (deletion is marked with a flag of some sort), which is specified in the data source definition as follows:

"dataDeletionDetectionPolicy"": {
    "@odata.type" : "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
    "softDeleteColumnName" : "the property that specifies whether a document was deleted",
    "softDeleteMarkerValue" : "the value that identifies a document as deleted"
}

If you're using a custom query, make sure that the property referenced by softDeleteColumnName is projected by the query.

The softDeleteColumnName must be a top-level field in the index. Using nested fields within complex data types as the softDeleteColumnName isn't supported.

The following example creates a data source with a soft-deletion policy:

POST https://[service name].search.windows.net/datasources?api-version=2024-07-01
Content-Type: application/json
api-key: [Search service admin key]

{
    "name": "[my-cosmosdb-ds]",
    "type": "cosmosdb",
    "credentials": {
        "connectionString": "AccountEndpoint=https://[cosmos-account-name].documents.azure.com;AccountKey=[cosmos-account-key];Database=[cosmos-database-name]"
    },
    "container": { "name": "[my-cosmos-collection]" },
    "dataChangeDetectionPolicy": {
        "@odata.type": "#Microsoft.Azure.Search.HighWaterMarkChangeDetectionPolicy",
        "highWaterMarkColumnName": "_ts"
    },
    "dataDeletionDetectionPolicy": {
        "@odata.type": "#Microsoft.Azure.Search.SoftDeleteColumnDeletionDetectionPolicy",
        "softDeleteColumnName": "isDeleted",
        "softDeleteMarkerValue": "true"
    }
}

Use .NET

For data accessed through the SQL API protocol, you can use the .NET SDK to automate with indexers. We recommend that you review the previous REST API sections to learn concepts, workflow, and requirements. You can then refer to following .NET API reference documentation to implement a JSON indexer in managed code:

Next steps

You can now control how you run the indexer, monitor status, or schedule indexer execution. The following articles apply to indexers that pull content from Azure Cosmos DB: