Azure AI Search Split Skill not splitting chunks accordingly

Question

I am running into problems with the Azure AI Search and its skillset.

I get the following error:

Skill input 'text' was '12684' tokens, which is greater then the maximum allowed '8000' tokens. Consider chunking the text with the SplitSkill in order to be able to generate embeddings for it.

This error is clearly coming from the AzureOpenAIEmbedding Skill, even though the debug messages don't show this explicitly.

I DO chunking before the Embedding Skill. contentSafe is just a variable I introduced in an earlier skill where I fix some things in the default content extraction.

I'm using API Version 2024-11-01-preview

This is (a short version) of my skillset:

{

"@odata.type": "#Microsoft.Skills.Text.SplitSkill",

"name": "Split",

"textSplitMode": "pages",

"unit": "azureOpenAITokens",

"maximumPageLength": 700,

"pageOverlapLength": 100,

"context": "/document",

"defaultLanguageCode": "de",

"maximumPagesToTake": 0,

"inputs": [

{

"name": "text",

"source": "/document/contentSafe"

}

],

"outputs": [

{

"name": "textItems",

"targetName": "pages"

}

]

},

{

"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",

"name": "OpenAIEmbeddings",

"description": "Convert chunk content to embeddings",

"resourceUri": "{{ embeddingEndpoint }}",

"deploymentId": "text-embedding-3-large",

"context": "/document/pages/*",

"modelName": "text-embedding-3-large",

"dimensions": 1024,

"inputs": [

{

"name": "text",

"source": "/document/pages/*"

}

],

"outputs": [

{

"name": "embedding"

}

]

},

Answer

Hello Alexander Grimm,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having an issue with the SplitSkill` not chunking your text properly, leading to the AzureOpenAIEmbeddingSkill receiving input that exceeds the token limit.

Since you're using the 2024-11-01-preview version, make sure that all parameters and configurations are compatible with this version. There might be updates or changes in the preview version that could affect the behavior of the skills. Also, check that the maximumPageLength is set correctly. For token-based chunking, the recommended length is typically around 512 tokens - https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit). You might want to reduce the maximumPageLength from 700 to a lower value, such as 512, to see if it helps. This is another answer you might find helpful: https://learn.microsoft.com/en-us/answers/questions/2127654/azure-ai-search-split-skillset-confifuration

Secondly, the pageOverlapLength is set to 100, which is generally fine. However, you might want to experiment with reducing this value to see if it impacts the chunking behavior.

Thirdly, double-check the skillset configuration to ensure there are no issues with the input and output mappings. Make sure the SplitSkill is correctly outputting the chunks to the pages field, and that the AzureOpenAIEmbeddingSkill is correctly sourcing from this field.

In addition, sometimes, the content itself might have characteristics that cause issues with chunking. Ensure that the content in contentSafe is being processed as expected before it reaches the SplitSkill.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

Azure AI Search Split Skill not splitting chunks accordingly

1 answer

Your answer