Azure AI Search Split Skill not splitting chunks accordingly

Alexander Grimm 10 Reputation points
2025-01-15T09:37:38.4933333+00:00

I am running into problems with the Azure AI Search and its skillset.

I get the following error:

Skill input 'text' was '12684' tokens, which is greater then the maximum allowed '8000' tokens. Consider chunking the text with the SplitSkill in order to be able to generate embeddings for it.

This error is clearly coming from the AzureOpenAIEmbedding Skill, even though the debug messages don't show this explicitly.

I DO chunking before the Embedding Skill. contentSafe is just a variable I introduced in an earlier skill where I fix some things in the default content extraction.

I'm using API Version 2024-11-01-preview

This is (a short version) of my skillset:

{

"@odata.type": "#Microsoft.Skills.Text.SplitSkill",

"name": "Split",

"textSplitMode": "pages",

"unit": "azureOpenAITokens",

"maximumPageLength": 700,

"pageOverlapLength": 100,

"context": "/document",

"defaultLanguageCode": "de",

"maximumPagesToTake": 0,

"inputs": [

{

"name": "text",

"source": "/document/contentSafe"

}

],

"outputs": [

{

"name": "textItems",

"targetName": "pages"

}

]

},

{

"@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",

"name": "OpenAIEmbeddings",

"description": "Convert chunk content to embeddings",

"resourceUri": "{{ embeddingEndpoint }}",

"deploymentId": "text-embedding-3-large",

"context": "/document/pages/*",

"modelName": "text-embedding-3-large",

"dimensions": 1024,

"inputs": [

{

"name": "text",

"source": "/document/pages/*"

}

],

"outputs": [

{

"name": "embedding"

}

]

},

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,170 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 17,016 Reputation points
    2025-01-15T21:04:39.3033333+00:00

    Hello Alexander Grimm,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having an issue with the SplitSkill` not chunking your text properly, leading to the AzureOpenAIEmbeddingSkill receiving input that exceeds the token limit.

    Since you're using the 2024-11-01-preview version, make sure that all parameters and configurations are compatible with this version. There might be updates or changes in the preview version that could affect the behavior of the skills. Also, check that the maximumPageLength is set correctly. For token-based chunking, the recommended length is typically around 512 tokens - https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit). You might want to reduce the maximumPageLength from 700 to a lower value, such as 512, to see if it helps. This is another answer you might find helpful: https://learn.microsoft.com/en-us/answers/questions/2127654/azure-ai-search-split-skillset-confifuration

    Secondly, the pageOverlapLength is set to 100, which is generally fine. However, you might want to experiment with reducing this value to see if it impacts the chunking behavior.

    Thirdly, double-check the skillset configuration to ensure there are no issues with the input and output mappings. Make sure the SplitSkill is correctly outputting the chunks to the pages field, and that the AzureOpenAIEmbeddingSkill is correctly sourcing from this field.

    In addition, sometimes, the content itself might have characteristics that cause issues with chunking. Ensure that the content in contentSafe is being processed as expected before it reaches the SplitSkill.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.