Dear Microsoft Team,
I am facing an issue with AI Search and Azure OpenAI integration in my project. I have a 120-page PDF document that I split into 50 groups using Python and uploaded the grouped data to Azure AI Search. The index and indexer count are accurate at 50, and I built a chatbot using the Ada-embedding model via Azure OpenAI Service. While the AI Search Explorer retrieves accurate results, the chatbot provides correct answers for approximately 90% of queries. However, for certain cases, such as querying the "flip policy" on page 73 (grouped within pages 70–73), the chatbot fails to provide the correct output. It seems the chatbot might not be scanning the document completely for some queries. Could you help identify if the issue is originating from AI Search or the chatbot and suggest solutions to resolve it? I'm attaching the error details below.
We have created an Azure Vector index with id ‘4358cd94-9515-422b-b642-3abdefd1ce10’. Along with the Text we are storing the embeddings. We are doing text chunking having one page of the Pdf in one Azure Page. The source document we are using contains around 120 pages and the text is spread across multiple pages. So before creating the Azure Index we are grouping the pages then we create embeddings then we store it in Azure index. After grouping total number of pages is 50.
When we do a Hybrid Search with the question ‘What are the requirements for appraisal on a flip transaction?’ Vector Search is not returning the Document containing the below text which is in Page number 74:
13. Flip Policy
Flip transactions must comply with the HPML appraisal rules in Regulation-Z (Reg-Z). The full Reg-Z revisions can be found at http://www.consumerfinance.gov/regulations/appraisals-for-higher-priced-mortgage-loans. A second appraisal is required in the following circumstances: · Greater than 10% increase in sales price if the seller acquired the property in the past 90-days · Greater than 20% increase in sales price if the seller acquired the property in the past 91-180 days · These requirements do not apply if the seller is FNMA, FHLMC, HUD or any other government entity.
The Hybrid Search will return below page contents:
'Title': 'Content from Pages 101-104', 'Score': 0.032786883413791656, 'Content': 'FOR INTERNAL USE ONLY ANGEL OAK INVESTOR CASH …
'Title': 'Content from Pages 14-26', 'Score': 0.0320020467042923, 'Content': '2 Appraisal and Property Requirements\n2.1 Appraisal Transfers\nAppraisal transfers …
'Title': 'Content from Pages 70-73', 'Score': 0.03021353855729103, 'Content': '12 Transaction Types\n12.1 Purchase\nThe lesser of the purchase price or appraised value of the subject property is used to calculate …
'Title': 'Content from Pages 37-49', 'Score': 0.01587301678955555, 'Content': "5 Credit and Liabilities\n5.1 General Information\nA U.S. credit report is required for each borrower on the loan …
'Title': 'Content from Pages 33-36', 'Score': 0.015384615398943424, 'Content': '4 Borrowers\n4.1 Borrowers – General\nThe USA Patriot Act requires banks and financial institutions to verify the name …
Below is the code snipped used for Hybrid Search:
response = requests.post(
"https://api.openai.com/v1/embeddings",
headers=headers,
json={
"input": query,
"model": "text-embedding-ada-002"
}
)
response.raise_for_status()
embedding = response.json()["data"][0]["embedding"]
# Adjust vector query parameters
vector_query = VectorizedQuery(
vector=embedding,
k_nearest_neighbors=5, # Increased from 5 to 10
fields="contentVector"
)
# Add hybrid search
search_client = SearchClient(endpoint=config["endpoint"],
index_name=index_name,
credential=AzureKeyCredential(config["admin_key"]))
# Combine vector search with keyword search
results = search_client.search(
search_text=query, # Add keyword search
vector_queries=[vector_query],
select=["title", "content", "category"],
top=10, # Increase number of results
semantic_configuration_name="my-semantic-config" # Use semantic search
)
response = []
for result in results:
item = {
"Title": result['title'],
"Score": result['@search.score'],
"Content": result['content'],
"Category": result['category']
}
response.append(item)
# Sort results by relevance score
response.sort(key=lambda x: x['Score'], reverse=True)
Best regards, Sudhakar.P