Information retrieval

Azure AI services
Azure AI Search
Azure OpenAI Service
Azure Machine Learning

Once you have generated the embeddings for your chunks, the next step is to generate the index in the vector database and experiment to determine the optimal searches to perform. When you're experimenting with information retrieval, there are several areas to consider, including configuration options for the search index, the types of searches you should perform, and your reranking strategy. This article covers these three topics.

This article is part of a series. Read the introduction.

Search index

Note

Azure AI Search is a first party Azure search service. This section will mention some specifics for AI Search. If you're using a different store, consult the documentation to find the key configuration for that service.

The search index in your store has a column for every field in your data. Search stores generally have support for nonvector data types such as string, Boolean, integer, single, double, datetime, and collections like Collection(single) and vector data types such as Collection(single). For each column, you must configure information such as the data type, whether the field is filterable, retrievable, and/or searchable.

The following are some key decisions you must make for the vector search configuration that are applied to vector fields:

  • Vector search algorithm - The algorithm used to search for relative matches. Azure AI Search has a brute-force algorithm option that scans the entire vector space called exhaustive KNN, and a more performant algorithm option that performs an approximate nearest neighbor (ANN) search called Hierarchical Navigable Small World (HNSW).
  • metric - This configuration is the similarity metric used to calculate nearness by the algorithm. The options in Azure AI Search are cosine, dotProduct, and Euclidean. If you're using Azure OpenAI embedding models, choose cosine.
  • efConstruction - Parameter used during Hierarchical Navigable Small Worlds (HNSW) index construction that sets the number of nearest neighbors that are connected to a vector during indexing. A larger efConstruction value results in a better-quality index than a smaller number. The tradeoff is that a larger value requires more time, storage, and compute. efConstruction should be higher for a large number of chunks and lower for a low number of chunks. Determining the optimal value requires experimentation with your data and expected queries.
  • efSearch - Parameter that is used at query time to set the number of nearest neighbors (that is, similar chunks) used during search.
  • m - The bidirectional link count. The range is 4 to 10, with lower numbers returning less noise in the results.

In Azure AI Search, the vector configurations are encapsulated in a vectorSearch configuration. When configuring your vector columns, you reference the appropriate configuration for that vector column and set the number of dimensions. The vector column's dimensions attribute represents the number of dimensions generated by the embedding model you chose. For example, the storage-optimized text-embedding-3-small model generates 1,536 dimensions.

Searches

When executing queries from your prompt orchestrator against your search store, you have many options to consider. You need to determine:

  • What type of search you're going to perform: vector or keyword or hybrid
  • Whether you're going to query against one or more columns
  • Whether you're going to manually run multiple queries, such as a keyword query and a vector search
  • Whether the query needs to be broken down into subqueries
  • Whether filtering should be used in your queries

Your prompt orchestrator might take a static approach or a dynamic approach mixing the approaches based on context clues from the prompt. The following sections address these options to help you experiment to find the right approach for your workload.

Search types

Search platforms generally support full text and vector searches. Some platforms, like Azure AI Search support hybrid searches. To see capabilities of various vector search offerings, review Choose an Azure service for vector search.

Vector searches match on similarity between the vectorized query (prompt) and vector fields.

Important

You should perform the same cleaning operations you performed on chunks before embedding the query. For example, if you lowercased every word in your embedded chunk, you should lowercase every word in the query before embedding.

Note

You can perform a vector search against multiple vector fields in the same query. In Azure AI Search, that is technically a hybrid search. For more information see that section.

embedding = embedding_model.generate_embedding(
    chunk=str(pre_process.preprocess(query))
)

vector = RawVectorQuery(
    k=retrieve_num_of_documents,
    fields="contentVector",
    vector=embedding,
)

results = client.search(
    search_text=None,
    vector_queries=[vector],
    top=retrieve_num_of_documents,
    select=["title", "content", "summary"],
)

The sample code performs a vector search against the contentVector field. Note that the code that embeds the query preprocesses the query first. That preprocess should be the same code that preprocesses the chunks prior to embedding. The embedding model must be the same embedding model that embedded the chunks.

Full text searches match plain text stored in an index. It's a common practice to extract keywords from a query and use those extracted keywords in a full text search against one or more indexed columns. Full text searches can be configured to return matches where any terms or all terms match.

You have to experiment to determine which fields are effective to run full text searches against. As discussed in the Enrichment phase, keyword, and entity metadata fields are good candidates to consider for full text search in scenarios where content has similar semantic meaning, but entities or keywords differ. Other common fields to consider for full text search are title, summary, and the chunk text.

formatted_search_results = []

results = client.search(
    search_text=query,
    top=retrieve_num_of_documents,
    select=["title", "content", "summary"],
)

formatted_search_results = format_results(results)

The sample code performs a full text search against the title, content, and summary fields.

Azure AI Search supports Hybrid queries where your query can contain one or more text searches and one or more vector searches. The platform performs each query, get the intermediate results, rerank the results using Reciprocal Rank Fusion (RRF), and return the top N results.

 embedding = embedding_model.generate_embedding(
    chunk=str(pre_process.preprocess(query))
)
vector1 = RawVectorQuery(
    k=retrieve_num_of_documents,
    fields="contentVector",
    vector=embedding,
)
vector2 = RawVectorQuery(
    k=retrieve_num_of_documents,
    fields="questionVector",
    vector=embedding,
)

results = client.search(
    search_text=query,
    vector_queries=[vector1, vector2],
    top=retrieve_num_of_documents,
    select=["title", "content", "summary"],
)

The sample code performs a full text search against the title, content, and summary fields and vector searches against the contentVector and questionVector fields. The Azure AI Search platform runs all the queries in parallel, reranks the results, and returns the top retrieve_num_of_documents documents.

Manual multiple

You can, of course, run multiple queries, such as a vector search and a keyword full text search, manually. You aggregate the results and rerank the results manually and return the top results. The following are use cases for manual multiple:

  • You're using a search platform that doesn't support hybrid searches. You would follow this option to perform your own hybrid search.
  • You want to run full text searches against different queries. For example, you might extract keywords from the query and run a full text search against your keywords metadata field. You might then extract entities and run a query against the entities metadata field.
  • You want to control the reranking process yourself.
  • The query requires multiple subqueries to be run to retrieve grounding data from multiple sources.

Multiple subqueries

Some prompts are complex and require more than one collection of data to ground the model. For example, the query "How do electric cars work and how do they compare to ICE vehicles?" likely require grounding data from multiple sources.

It's good practice to determine whether the query requires multiple searches before running any searches. If you deem multiple subqueries are required, you can run manual multiple queries for all the queries. Use a large language model to determine whether multiple subqueries are required. The following prompt is taken from the RAG Experiment Accelerator GitHub repository that is used to categorize a query as simple or complex, with complex requiring multiple queries:

Consider the given question to analyze and determine if it falls into one of these categories:
1. Simple, factual question
  a. The question is asking for a straightforward fact or piece of information
  b. The answer could likely be found stated directly in a single passage of a relevant document
  c. Breaking the question down further is unlikely to be beneficial
  Examples: "What year did World War 2 end?", "What is the capital of France?, "What is the features of productX?"
2. Complex, multi-part question
  a. The question has multiple distinct components or is asking for information about several related topics
  b. Different parts of the question would likely need to be answered by separate passages or documents
  c. Breaking the question down into sub-questions for each component would allow for better results
  d. The question is open-ended and likely to have a complex or nuanced answer
  e. Answering it may require synthesizing information from multiple sources
  f. The question may not have a single definitive answer and could warrant analysis from multiple angles
  Examples: "What were the key causes, major battles, and outcomes of the American Revolutionary War?", "How do electric cars work and how do they compare to gas-powered vehicles?"

Based on this rubric, does the given question fall under category 1 (simple) or category 2 (complex)? The output should be in strict JSON format. Ensure that the generated JSON is 100 percent structurally correct, with proper nesting, comma placement, and quotation marks. There should not be any comma after last element in the JSON.

Example output:
{
  "category": "simple"
}

A large language model can also be used to extract subqueries from a complex query. The following prompt is taken from the RAG Experiment Accelerator GitHub repository that converts a complex query into multiple subqueries.

Your task is to take a question as input and generate maximum 3 sub-questions that cover all aspects of the original question. The output should be in strict JSON format, with the sub-questions contained in an array.
Here are the requirements:
1. Analyze the original question and identify the key aspects or components.
2. Generate sub-questions that address each aspect of the original question.
3. Ensure that the sub-questions collectively cover the entire scope of the original question.
4. Format the output as a JSON object with a single key "questions" that contains an array of the generated sub-questions.
5. Each sub-question should be a string within the "questions" array.
6. The JSON output should be valid and strictly formatted.
7. Ensure that the generated JSON is 100 percent structurally correct, with proper nesting, comma placement, and quotation marks. The JSON should be formatted with proper indentation for readability.
8. There should not be any comma after last element in the array.

Example input question:
What are the main causes of deforestation, and how can it be mitigated?

Example output:
{
  "questions": [
    "What are the primary human activities that contribute to deforestation?",
    "How does agriculture play a role in deforestation?",
    "What is the impact of logging and timber harvesting on deforestation?",
    "How do urbanization and infrastructure development contribute to deforestation?",
    "What are the environmental consequences of deforestation?",
    "What are some effective strategies for reducing deforestation?",
    "How can reforestation and afforestation help mitigate the effects of deforestation?",
    "What role can governments and policies play in preventing deforestation?",
    "How can individuals and communities contribute to reducing deforestation?"
  ]
}

Passing images in queries

Some multimodal models such as GPT-4V and GPT-4o can interpret images. If you are using these models, you can choose whether you want to avoid chunking your images and pass the image as part of the prompt to the multimodal model. You should experiment to determine how this approach performs compared to chunking the images with and without passing additional context. You should also compare the difference in cost between the approaches and do a cost-benefit analysis.

Filtering

Fields in the search store that are configured as filterable can be used to filter queries. Consider filtering on keywords and entities for queries that use those fields to help narrow down the result. Filtering allows you to retrieve only the data that satisfies certain conditions from an index by eliminating irrelevant data. This improves the overall performance of the query with more relevant results. As with every decision, it's important to experiment and test. Queries might not have keywords or wrong keywords, abbreviations, or acronyms. You need to take these cases into consideration.

Reranking

Reranking allows you to run one or more queries, aggregate the results, and rank those results. Consider the following reasons to rerank your search results:

  • You performed manual multiple searches and you want to aggregate the results and rank them.
  • Vector and keyword searches aren't always accurate. You can increase the count of documents returned from your search, potentially including some valid results that would otherwise be ignored, and use reranking to evaluate the results.

You can use a large language model or cross-encoder to perform reranking. Some platforms, like Azure AI Search have proprietary methods to rerank results. You can evaluate these options for your data to determine what works best for your scenario. The following sections provide details on these methods.

Large language model reranking

The following is a sample large language model prompt from the RAG experiment accelerator that reranks results.

A list of documents is shown below. Each document has a number next to it along with a summary of the document. A question is also provided.
Respond with the numbers of the documents you should consult to answer the question, in order of relevance, as well as the relevance score as json string based on json format as shown in the schema section. The relevance score is a number from 1–10 based on how relevant you think the document is to the question. The relevance score can be repetitive. Don't output any additional text or explanation or metadata apart from json string. Just output the json string and strip rest every other text. Strictly remove any last comma from the nested json elements if it's present.
Don't include any documents that are not relevant to the question. There should exactly be one documents element.
Example format:
Document 1:
content of document 1
Document 2:
content of document 2
Document 3:
content of document 3
Document 4:
content of document 4
Document 5:
content of document 5
Document 6:
content of document 6
Question: user defined question

schema:
{
    "documents": {
        "document_1": "Relevance",
        "document_2": "Relevance"
    }
}

Cross-encoder reranking

The following example from the RAG Experiment Accelerator GitHub repository uses the CrossEncoder provided by Hugging Face to load the Roberta model. It then iterates over each chunk and uses the model to calculate similarity, giving them a value. We sort results and return the top N.

from sentence_transformers import CrossEncoder
...

model_name = 'cross-encoder/stsb-roberta-base'
model = CrossEncoder(model_name)

cross_scores_ques = model.predict(
    [[user_prompt, item] for item in documents],
    apply_softmax=True,
    convert_to_numpy=True,
)

top_indices_ques = cross_scores_ques.argsort()[-k:][::-1]
sub_context = []
for idx in list(top_indices_ques):
    sub_context.append(documents[idx])

Semantic ranking

Azure AI Search has a proprietary feature called semantic ranking. This feature uses deep learning models that were adapted from Microsoft Bing that promote the most semantically relevant results. Read the following to see How semantic ranker works.

Search guidance

Consider the following general guidance when implementing your search solution:

  • Title, summary, source, and the raw content (not cleaned) are good fields to return from a search.
  • Determine up front whether a query needs to be broken down into subqueries.
  • In general, it's a good practice to run queries on multiple fields, both vector and text queries. When you receive a query, you don't know whether vector search or text search are better. You further don't know what fields the vector search or keyword search are best to search. You can search on multiple fields, potentially with multiple queries, rerank the results and return the results with the highest scores.
  • Keywords and entities fields are good candidates to consider filtering on.
  • It's a good practice to use keywords along with vector searches. The keywords filter the results to a smaller subset. The vector store works against that subset to find the best matches.

Search evaluation

In the preparation phase, you should have gathered test queries along with test document information. You can use the following information you gathered in that phase to evaluate your search results:

  • The query - The sample query
  • The context - The collection of all the text in the test documents that address the sample query

The following are three well established retrieval evaluation methods you can use to evaluate your search solution:

  • Precision at K - The percentage of correctly identified relevant items out of the total search results. This metric focuses on the accuracy of your search results.
  • Recall at K - Recall at K measures the percentage of relevant items in the top K out of the total possible relative items. This metric focuses on search results coverage.
  • Mean Reciprocal Rank (MRR) - MRR measures the average of the reciprocal ranks of the first relevant answer in your ranked search results. This metric focuses on where the first relevant result occurs in the search results.

You should test both positive and negative examples. For the positive examples, you want the metrics to be as close to 1 as possible. For the negative examples, where your data shouldn't be able to address the queries, you want the metrics to be as close to 0 as possible. You should test all your test queries and average the positive query results and the negative query results to understand how your search results are performing in aggregate.

Next steps