Hybrid search in Azure Cosmos DB for NoSQL (preview)

Azure Cosmos DB for NoSQL now supports a powerful hybrid search capability that combines Vector Search with Full Text Search scoring (BM25) using the Reciprocal Rank Fusion (RRF) function.

Note

Full Text & Hybrid Search is in early preview and may not be available in all regions at this time.

Hybrid search leverages the strengths of both vector-based and traditional keyword-based search methods to deliver more relevant and accurate search results. Hybrid search is easy to do in Azure Cosmos DB for NoSQL due to the ability to store both metadata and vectors within the same document.

Hybrid search in Azure Cosmos DB for NoSQL integrates two distinct search methodologies:

  • Vector search: Utilizes machine learning models to understand the semantic meaning of queries and documents. This allows for more nuanced and context-aware search results, especially useful for complex queries where traditional keyword search might fall short.
  • Full text search (BM25): A well-established algorithm that scores documents based on the presence and frequency of words and terms. BM25 is particularly effective for straightforward keyword searches, providing a robust baseline for search relevance.

The results from vector search and full text search are then combined using the Reciprocal Rank Fusion (RRF) function. RRF is a rank aggregation method that merges the rankings from multiple search algorithms to produce a single, unified ranking. This ensures that the final search results benefit from the strengths of both search approaches and offers multiple benefits.

  • Enhanced Relevance: By combining semantic understanding with keyword matching, hybrid search delivers more relevant results for a wide range of queries.
  • Improved Accuracy: The RRF function ensures that the most pertinent results from both search methods are prioritized.
  • Versatility: Suitable for various use cases including Retrieval Augmented Generation (RAG) to improve the responses generated by an LLM grounded on your own data.
  1. Enable the Vector Search in Azure Cosmos DB for NoSQL feature.
  2. Enable the Full Text & Hybrid Search for NoSQL preview feature.
  3. Create a container with a vector policy, full text policy, vector index, and full text index.
  4. Insert your data with text and vector properties.
  5. Run hybrid queries against the data.

Important

Currently, vector policies and vector indexes are immutable after creation. To make changes, please create a new collection.

A sample vector policy

{
   "vectorEmbeddings": [
       {
           "path":"/vector",
           "dataType":"float32",
           "distanceFunction":"cosine",
           "dimensions":3
       },

}

A sample full text policy

{
    "defaultLanguage": "en-US",
    "fullTextPaths": [
        {
            "path": "/text",
            "language": "en-US"
        }
    ]
}

A sample indexing policy with both full text and vector indexes

{
    "indexingMode": "consistent",
    "automatic": true,
    "includedPaths": [
        {
            "path": "/*"
        }
    ],
    "excludedPaths": [
        {
            "path": "/\"_etag\"/?"
        },
        {
            "path": "/vector/*"
        }
    ],
    "fullTextIndexes": [
        {
            "path": "/text"
        }
    ],
    "vectorIndexes": [
        {
            "path": "/vector",
            "type": "DiskANN"
        }
    ]
}

Hybrid search queries

Hybrid search queries can be executed by leveraging the RRF system function in an ORDER BY RANK clause that includes both a VectorDistance function and FullTextScore. For example, a parameterized query to find the top k most relevant results would look like:

SELECT TOP @k *
FROM c
ORDER BY RANK RRF(VectorDistance(c.vector, @queryVector), FullTextScore(c.content, [@searchTerm1, @searchTerm2, ...]))

Suppose you have a document that has vector embeddings stored in each document in the property c.vector and text data contained in the property c.text. To get the 10 most relevant documents using Hybrid search, the query can be written as:

SELECT TOP 10 * 
FROM c
ORDER BY RANK RRF(VectorDistance(c.vector, [1,2,3]), FullTextScore(c.text, ["text", "to", "search", "goes" ,"here])