Hello Alexander Grimm,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having an issue with the SplitSkill` not chunking your text properly, leading to the AzureOpenAIEmbeddingSkill receiving input that exceeds the token limit.
Since you're using the 2024-11-01-preview
version, make sure that all parameters and configurations are compatible with this version. There might be updates or changes in the preview version that could affect the behavior of the skills. Also, check that the maximumPageLength
is set correctly. For token-based chunking, the recommended length is typically around 512 tokens - https://learn.microsoft.com/en-us/azure/search/cognitive-search-skill-textsplit). You might want to reduce the maximumPageLength
from 700 to a lower value, such as 512, to see if it helps. This is another answer you might find helpful: https://learn.microsoft.com/en-us/answers/questions/2127654/azure-ai-search-split-skillset-confifuration
Secondly, the pageOverlapLength is set to 100, which is generally fine. However, you might want to experiment with reducing this value to see if it impacts the chunking behavior.
Thirdly, double-check the skillset configuration to ensure there are no issues with the input and output mappings. Make sure the SplitSkill is correctly outputting the chunks to the pages
field, and that the AzureOpenAIEmbeddingSkill is correctly sourcing from this field.
In addition, sometimes, the content itself might have characteristics that cause issues with chunking. Ensure that the content in contentSafe is being processed as expected before it reaches the SplitSkill.
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.