Index and query vectors in Azure Cosmos DB for NoSQL in Java
This article walks you through the process of how to create vector data, index the data, and then query the data in a container.
Before you use vector indexing and search, you must first enable vector search in Azure Cosmos DB for NoSQL. After you set up the Azure Cosmos DB container for vector search, you create a vector embedding policy. Next, you add vector indexes to the container indexing policy. Then you create a container with vector indexes and a vector embedding policy. Finally, you perform a vector search on the stored data.
Prerequisites
- An existing Azure Cosmos DB for NoSQL account.
- If you don't have an Azure subscription, try Azure Cosmos DB for NoSQL for free.
- If you have an existing Azure subscription, create a new Azure Cosmos DB for NoSQL account.
- The latest version of the Azure Cosmos DB Java SDK.
Enable the feature
To enable vector search for Azure Cosmos DB for NoSQL, follow these steps:
- Go to your Azure Cosmos DB for NoSQL resource page.
- On the left pane, under Settings, select Features.
- Select Vector Search in Azure Cosmos DB for NoSQL.
- Read the description of the feature to confirm that you want to enable it.
- Select Enable to turn on vector search in Azure Cosmos DB for NoSQL.
Tip
Alternatively, use the Azure CLI to update the capabilities of your account to support Azure Cosmos DB for NoSQL vector search.
az cosmosdb update \
--resource-group <resource-group-name> \
--name <account-name> \
--capabilities EnableNoSQLVectorSearch
The registration request is autoapproved, but it might take 15 minutes to take effect.
Understand the steps involved in vector search
The following steps assume that you know how to set up an Azure Cosmos DB for NoSQL account and create a database. The vector search feature is currently not supported on the existing containers. You need to create a new container. When you create the container, you specify the container-level vector embedding policy and the vector indexing policy.
Let's take an example of how to create a database for an internet-based bookstore. You want to store title, author, ISBN, and description information for each book. You also need to define the following two properties to contain vector embeddings:
- The
contentVector
property contains text embeddings that are generated from the text content of the book. For example, you concatenate thetitle
,author
,isbn
, anddescription
properties before you create the embedding. - The
coverImageVector
property is generated from images of the book's cover.
To perform a vector search, you:
- Create and store vector embeddings for the fields on which you want to perform vector search.
- Specify the vector embedding paths in the vector embedding policy.
- Include any vector indexes that you want in the indexing policy for the container.
For subsequent sections of this article, consider the following structure for the items stored in your container:
{
"title": "book-title",
"author": "book-author",
"isbn": "book-isbn",
"description": "book-description",
"contentVector": [2, -1, 4, 3, 5, -2, 5, -7, 3, 1],
"coverImageVector": [0.33, -0.52, 0.45, -0.67, 0.89, -0.34, 0.86, -0.78]
}
First, create the CosmosContainerProperties
object.
CosmosContainerProperties collectionDefinition = new CosmosContainerProperties(UUID.randomUUID().toString(), "Partition_Key_Def");
Create a vector embedding policy for your container
Now you need to define a container vector policy. This policy provides information that informs the Azure Cosmos DB query engine about how to handle vector properties in the VectorDistance
system functions. This policy also provides necessary information to the vector indexing policy, if you choose to specify one.
The following information is included in the container vector policy:
Parameter | Description |
---|---|
path |
The property path that contains vectors. |
datatype |
The type of the elements of the vector. (The default is Float32 .) |
dimensions |
The length of each vector in the path. (The default is 1536 .) |
distanceFunction |
The metric used to compute distance/similarity. (The default is Cosine .) |
For the example with book details, the vector policy might look like the following example:
// Creating vector embedding policy
CosmosVectorEmbeddingPolicy cosmosVectorEmbeddingPolicy = new CosmosVectorEmbeddingPolicy();
CosmosVectorEmbedding embedding1 = new CosmosVectorEmbedding();
embedding1.setPath("/coverImageVector");
embedding1.setDataType(CosmosVectorDataType.FLOAT32);
embedding1.setDimensions(8L);
embedding1.setDistanceFunction(CosmosVectorDistanceFunction.COSINE);
CosmosVectorEmbedding embedding2 = new CosmosVectorEmbedding();
embedding2.setPath("/contentVector");
embedding2.setDataType(CosmosVectorDataType.FLOAT32);
embedding2.setDimensions(10L);
embedding2.setDistanceFunction(CosmosVectorDistanceFunction.DOT_PRODUCT);
cosmosVectorEmbeddingPolicy.setCosmosVectorEmbeddings(Arrays.asList(embedding1, embedding2, embedding3));
collectionDefinition.setVectorEmbeddingPolicy(cosmosVectorEmbeddingPolicy);
Create a vector index in the indexing policy
After you decide on the vector embedding paths, you must add vector indexes to the indexing policy. Currently, the vector search feature for Azure Cosmos DB for NoSQL is supported only on new containers. When you create the container, you apply the vector policy. You can't modify the policy later. The indexing policy looks something like the following example:
IndexingPolicy indexingPolicy = new IndexingPolicy();
indexingPolicy.setIndexingMode(IndexingMode.CONSISTENT);
ExcludedPath excludedPath1 = new ExcludedPath("/coverImageVector/*");
ExcludedPath excludedPath2 = new ExcludedPath("/contentVector/*");
indexingPolicy.setExcludedPaths(ImmutableList.of(excludedPath1, excludedPath2));
IncludedPath includedPath1 = new IncludedPath("/*");
indexingPolicy.setIncludedPaths(Collections.singletonList(includedPath1));
// Creating vector indexes
CosmosVectorIndexSpec cosmosVectorIndexSpec1 = new CosmosVectorIndexSpec();
cosmosVectorIndexSpec1.setPath("/coverImageVector");
cosmosVectorIndexSpec1.setType(CosmosVectorIndexType.QUANTIZED_FLAT.toString());
CosmosVectorIndexSpec cosmosVectorIndexSpec2 = new CosmosVectorIndexSpec();
cosmosVectorIndexSpec2.setPath("/contentVector");
cosmosVectorIndexSpec2.setType(CosmosVectorIndexType.DISK_ANN.toString());
indexingPolicy.setVectorIndexes(Arrays.asList(cosmosVectorIndexSpec1, cosmosVectorIndexSpec2, cosmosVectorIndexSpec3));
collectionDefinition.setIndexingPolicy(indexingPolicy);
Finally, create the container with the container index policy and the vector index policy.
database.createContainer(collectionDefinition).block();
Important
The vector path is added to the excludedPaths
section of the indexing policy to ensure optimized performance for insertion. Not adding the vector path to excludedPaths
results in a higher request unit charge and latency for vector insertions.
Run a vector similarity search query
After you create a container with the vector policy that you want and insert vector data into the container, use the VectorDistance system function in a query to conduct a vector search.
Suppose that you want to search for books about food recipes by looking at the description. You first need to get the embeddings for your query text. In this case, you might want to generate embeddings for the query text food recipe
. After you have the embedding for your search query, you can use it in the VectorDistance
function in the vector search query to get all the items that are similar to your query:
SELECT TOP 10 c.title, VectorDistance(c.contentVector, [1,2,3,4,5,6,7,8,9,10]) AS SimilarityScore
FROM c
ORDER BY VectorDistance(c.contentVector, [1,2,3,4,5,6,7,8,9,10])
This query retrieves the book titles along with similarity scores with respect to your query. Here's an example in Java:
float[] embedding = new float[10];
for (int i = 0; i < 10; i++) {
array[i] = i + 1;
}
ArrayList<SqlParameter> paramList = new ArrayList<SqlParameter>();
paramList.add(new SqlParameter("@embedding", embedding));
SqlQuerySpec querySpec = new SqlQuerySpec("SELECT c.title, VectorDistance(c.contentVector,@embedding) AS SimilarityScore FROM c ORDER BY VectorDistance(c.contentVector,@embedding)", paramList);
CosmosPagedIterable<Family> filteredFamilies = container.queryItems(querySpec, new CosmosQueryRequestOptions(), Family.class);
if (filteredFamilies.iterator().hasNext()) {
Family family = filteredFamilies.iterator().next();
logger.info(String.format("First query result: Family with (/id, partition key) = (%s,%s)",family.getId(),family.getLastName()));
}