Facing an Issue in Azure AI Search.

Sudhakar P 25 Reputation points
2025-01-01T11:27:11.1466667+00:00

Dear Microsoft Team,

I am facing an issue with AI Search and Azure OpenAI integration in my project. I have a 120-page PDF document that I split into 50 groups using Python and uploaded the grouped data to Azure AI Search. The index and indexer count are accurate at 50, and I built a chatbot using the Ada-embedding model via Azure OpenAI Service. While the AI Search Explorer retrieves accurate results, the chatbot provides correct answers for approximately 90% of queries. However, for certain cases, such as querying the "flip policy" on page 73 (grouped within pages 70–73), the chatbot fails to provide the correct output. It seems the chatbot might not be scanning the document completely for some queries. Could you help identify if the issue is originating from AI Search or the chatbot and suggest solutions to resolve it? I'm attaching the error details below.

We have created an Azure Vector index with id ‘4358cd94-9515-422b-b642-3abdefd1ce10’. Along with the Text we are storing the embeddings. We are doing text chunking having one page of the Pdf in one Azure Page. The source document we are using contains around 120 pages and the text is spread across multiple pages. So before creating the Azure Index we are grouping the pages then we create embeddings then we store it in Azure index. After grouping total number of pages is 50.

When we do a Hybrid Search with the question ‘What are the requirements for appraisal on a flip transaction?’ Vector Search is not returning the Document containing the below text which is in Page number 74:

13. Flip Policy

Flip transactions must comply with the HPML appraisal rules in Regulation-Z (Reg-Z). The full Reg-Z revisions can be found at http://www.consumerfinance.gov/regulations/appraisals-for-higher-priced-mortgage-loans. A second appraisal is required in the following circumstances: · Greater than 10% increase in sales price if the seller acquired the property in the past 90-days · Greater than 20% increase in sales price if the seller acquired the property in the past 91-180 days · These requirements do not apply if the seller is FNMA, FHLMC, HUD or any other government entity.

The Hybrid Search will return below page contents:

'Title': 'Content from Pages 101-104', 'Score': 0.032786883413791656, 'Content': 'FOR INTERNAL USE ONLY ANGEL OAK INVESTOR CASH …

'Title': 'Content from Pages 14-26', 'Score': 0.0320020467042923, 'Content': '2 Appraisal and Property Requirements\n2.1 Appraisal Transfers\nAppraisal transfers …

'Title': 'Content from Pages 70-73', 'Score': 0.03021353855729103, 'Content': '12 Transaction Types\n12.1 Purchase\nThe lesser of the purchase price or appraised value of the subject property is used to calculate …

'Title': 'Content from Pages 37-49', 'Score': 0.01587301678955555, 'Content': "5 Credit and Liabilities\n5.1 General Information\nA U.S. credit report is required for each borrower on the loan …

'Title': 'Content from Pages 33-36', 'Score': 0.015384615398943424, 'Content': '4 Borrowers\n4.1 Borrowers – General\nThe USA Patriot Act requires banks and financial institutions to verify the name …

Below is the code snipped used for Hybrid Search:

response = requests.post(

            "https://api.openai.com/v1/embeddings",

            headers=headers,

            json={

                "input": query,

                "model": "text-embedding-ada-002"

            }

        )

        response.raise_for_status()

        embedding = response.json()["data"][0]["embedding"]

 

        # Adjust vector query parameters

        vector_query = VectorizedQuery(

            vector=embedding, 

            k_nearest_neighbors=5,  # Increased from 5 to 10

            fields="contentVector"

        )

        

        # Add hybrid search

        search_client = SearchClient(endpoint=config["endpoint"], 

                                   index_name=index_name, 

                                   credential=AzureKeyCredential(config["admin_key"]))

        

        # Combine vector search with keyword search

        results = search_client.search(

            search_text=query,  # Add keyword search

            vector_queries=[vector_query],

            select=["title", "content", "category"],

            top=10,  # Increase number of results

            semantic_configuration_name="my-semantic-config"  # Use semantic search

        )

 

        response = []

        for result in results:

 

            item = {

                "Title": result['title'],

                "Score": result['@search.score'],

                "Content": result['content'],

                "Category": result['category']

            }

            response.append(item)

 

        # Sort results by relevance score

        response.sort(key=lambda x: x['Score'], reverse=True)

Best regards, Sudhakar.P

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,133 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,480 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,026 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Azar 25,145 Reputation points MVP
    2025-01-01T16:43:09.25+00:00

    Hi there Sudhakar P

    Thanks for using QandA platform

    seems like your current chunking strategy may cause the chatbot to miss context if related content spans multiple groups. try revising the chunking process to include overlapping content between groups. also, while the text-embedding-ada is suitable for general embeddings, it might not capture specific domain nuances. try fine-tuning the embeddings or trying a different model. Do preprocessing user queries to include related keywords that sometiemes can improve semantic matching.

    If this helps kindly accept the answer thanks much.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.