SplitSkill in Azure Cognitive Search retrieve chunk_id

Andrea Quarta 30

When using the SplitSkill in Azure Cognitive Search, I need to know how to retrieve the unique chunk ID for each split section of the document. Since the skill divides the text into chunks (pages), I want to understand where the chunk ID is stored and how I can access it. Is the chunk ID available as metadata, or do I need to explicitly map it in the index schema?

link to skill: https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit

{
    "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
    "textSplitMode" : "pages", 
    "maximumPageLength": 1000,
    "pageOverlapLength": 100,
    "maximumPagesToTake": 1,
    "defaultLanguageCode": "en",
    "inputs": [
        {
            "name": "text",
            "source": "/document/content"
        },
        {
            "name": "languageCode",
            "source": "/document/language"
        }
    ],
    "outputs": [
        {
            "name": "textItems",
            "targetName": "mypages"
        }
    ]
}

Is the chunk ID available as metadata? how can i map it ?

User's image

1 answer

Shree Hima Bindu Maganti 2,895 Reputation points Microsoft Vendor

2025-02-04T17:57:16.47+00:00

Hi @Andrea Quarta
Thanks for the question and using MS Q&A platform.
The SplitSkill in Azure Cognitive Search does not automatically generate a unique chunk ID for each section of a document. To retrieve and store a chunk ID, you need to explicitly map it in your index schema by defining an additional output field in your skillset that captures the chunk ID as metadata.

To achieve this, modify the outputs section of your SplitSkill configuration to include a field for the chunk ID. This field can then be mapped to your index schema, enabling you to access it later when querying the indexed data.
Chunk large documents for vector search solutions in Azure AI Search

Chunk and vectorize by document layout or structure
Let me know if you have any assistances. If the answer is helpful, please click Accept Answer and kindly upvote it so that other people who faces similar issue may get benefitted from it.
Please sign in to rate this answer.
Andrea Quarta 30 Reputation points

2025-02-05T11:30:48.5466667+00:00

Thank you for your response. I understand that the SplitSkill does not automatically generate a unique chunk ID, but I am struggling to retrieve the chunk_id that I have in the search index.

Currently, the TextSplitter divides the document into pages, but I need to extract the chunk_id in order to associate the extracted chunk with the correct page number. For example, my indexed document on search portal contains:

"chunk_id": "19fb4f9bcb31_aHR0cHM6Ly9zYW9wZW5haS5ibG9iLmNvcmUud2luZG93cy5uZXQvZHJpdi1yYWctdXJsd2VicGRmLWRldi9FTUVBX1NEU09fV0FfOTYyNDM5X0lUX0lULnBkZg2_pages_6", "chunk": "example chunk"

These are the skills I am currently using:

field_skill = ConditionalSkill( description="Skill to extract fields from a document", context="/document", inputs=[ InputFieldMappingEntry(name="condition", source="= true"), InputFieldMappingEntry(name="whenFalse", source="= null") ], outputs=[ OutputFieldMappingEntry(name="output", target_name="field_type"), ], ) split_skill = SplitSkill( description="Split skill to chunk documents", text_split_mode="pages", context="/document", maximum_page_length=2000, page_overlap_length=500, inputs=[ InputFieldMappingEntry(name="text", source="/document/content"), ], outputs=[ OutputFieldMappingEntry(name="textItems", target_name="pages") ], ) # page_skill = WebApiSkill( # description="Skill to get security groups", # context="/document", # http_method="POST", # batch_size=100, # uri=function_app_url, # inputs=[ # InputFieldMappingEntry(name="chunk_id", source="/document/pages/*/chunk_id"), # ], # outputs=[ # OutputFieldMappingEntry(name="number_page", target_name="number_page") # ], # ) # Adding a WebApiSkill or DocumentExtractionSkill to generate the chunk ID page_skill = DocumentExtractionSkill( name="generateChunkIdSkill", description="Generates a unique chunk_id for each chunk", context="/document/pages/*", inputs=[ InputFieldMappingEntry(name="file_data", source="/document/pages/*/text") ], outputs=[ OutputFieldMappingEntry(name="recordId", target_name="recordId"), OutputFieldMappingEntry(name="chunkIndex", target_name="chunkIndex") ], )

I have tried using a custom skill via an Azure Function, but I am unsure about the correct source for chunk_id. Should I use source="/document/pages/*/chunk_id"? I also attempted using DocumentExtractionSkill, but I may be making a mistake in the implementation.

Could you please clarify how I can correctly extract the chunk_id?

Shree Hima Bindu Maganti 2,895 Reputation points Microsoft Vendor

2025-02-07T17:23:50.5+00:00

Hi @Andrea Quarta
Apology for the late Response!
In order to configure chunk_id correctly in Azure Cognitive Search, ensure your skillset captures the relevant chunk_id in the document processing stage. As SplitSkill does not assign a chunk_id by itself, a defined custom skill is required for this ID assignment.

At this stage, the DocumentExtractionSkill can be benefitted from to assign an identifiable chunk_id to each chunk extracted.

Ensure your DocumentExtractionSkill has an output field for the chunk_id to capture the generated ID.

When creating a page_skill do not use source="/document/pages/*/chunk_id" unless you declared that field first. Rather, set an additional output field in DocumentExtractionSkill termed chunk_id.

This output should subsequently be linked to the index schema thus enabling retrieval of this information when performing queries on the indexed data.

Lastly, consider the following sample on skillset configuration,

page_skill = DocumentExtractionSkill( name="generateChunkIdSkill", description="Generates a unique chunk_id for each chunk", context="/document/pages/*", inputs=[ InputFieldMappingEntry(name="file_data", source="/document/pages/*/text") ], outputs=[ OutputFieldMappingEntry(name="chunk_id", target_name="chunk_id"), # Ensure this field is defined in your index OutputFieldMappingEntry(name="chunkIndex", target_name="chunkIndex") ], )

Ensure the chunk_id field is in your index schema to store the generated IDs. This way, when you query the indexed documents, you can retrieve the chunk_id for each chunk.
Chunk large documents for vector search solutions in Azure AI Search

Chunk and vectorize by document layout or structure
Let me know if you have any further assistances.
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

SplitSkill in Azure Cognitive Search retrieve chunk_id

1 answer

Your answer