How to create index using custom chunking within the enrichment pipeline in Azure AI Search

Filiz Camuz 0 Reputation points
2024-12-18T16:58:02.8466667+00:00

When using the built-in SplitSkill in azure indexer pipeline, Azure AI Search automatically provides a chunk_id field to each chunk, allowing them to be individually indexed. However, when replacing this step with a custom Web API skill that returns multiple chunks, the lack of a chunk_id at the root of each chunk prevents the indexer from creating separate documents in the index. Although the pipeline runs error-free, no documents appear in the portal because the indexing process can't form properly keyed documents from the provided JSON structure.

How can I introduce chunk_id to the pipeline without getting the wrong output type error?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,119 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Laxman Reddy Revuri 1,130 Reputation points Microsoft Vendor
    2024-12-20T04:35:59.5733333+00:00

    Hi @Filiz Camuz
    Thanks for the question and using MS Q&A platform.
    1.Ensure that your custom Web API skill returns each chunk with a unique chunk_id. You can do this by modifying the JSON structure of the response.
    Here’s an example of how your API should format the output:

    {
     "values": [
       {
         "chunk_id": "1",
         "content": "This is the first chunk of text."
       },
       {
         "chunk_id": "2",
         "content": "This is the second chunk of text."
       }
       // Add more chunks as needed
     ]
    }
    

     2.Ensure that the form of JSON is as what Azure would expect. It should be that each chunk will be an object under the values array and that chunk_id will always be a string or number with no repeat.
    3.If there is output type error, then ensure that; The data types of chunk_id and other fields match those expected by Azure. You are returning a valid JSON response with proper syntax (no trailing commas).
    4.Before integrating it back into the Azure pipeline, test your API independently: Use tools like Postman or curl to send requests to your API and verify that it returns the expected JSON structure with chunk_id.
    5.After confirming that your API returns the correct format, update your Azure indexer configuration if needed: Ensure the indexer is configured to handle and process the chunk_id field correctly.
    references:
    https://learn.microsoft.com/en-us/azure/search/tutorial-rag-build-solution-pipeline

    https://learn.microsoft.com/en-us/azure/search/search-how-to-semantic-chunking

    I hope this information is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.