Why does the default chunking algorithm used by Azure AI Search's "Import and Vectorize" feature lead to poor performance in RAG?

Question

When you use the "Import and Vectorize" feature of Azure AI Search to get data from a container in a Storage Account and make an index based on it, it creates and uses this skillset.

{
  "@odata.context": ...,
  "@odata.etag": ...,
  "name": ...,
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#1",
      "description": null,
      "context": "/document/pages/*",
      "resourceUri": ....,
      "apiKey": "",
      "deploymentId": ...,
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "vector"
        }
      ],
      "authIdentity": null
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#2",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": ...,
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "vector",
            "source": "/document/pages/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

Firstly, why is the embedding skill before the splitting skill? Shouldn't it be the other way around? This way, won't all chunks of each document have the same vector? This does not appear to be the case, though.

Secondly, why is performance so bad when passing raw data files into this? I am working on applications where we need a chatbot using RAG on structured data. When I use a script I made to split the files based on their structure before uploading the individual chunks as separate files, I can get decent results. When I simply shove the raw files into an index using this "Import and Vectorize" feature, I get terrible results with the chatbot either saying that it does not know the answer or providing a wrong answer much more often than it produces something valid. This seems strange, as the 500-token overlap should prevent data from being lost across chunk boundaries in most cases.

Accepted Answer

@Bao, Jeremy (Cognizant) Thanks for your patience on this, I have checked with internal team and sharing the below.

The skills in the index are not ordered in execution by the number but by the inputs and outputs. The embedding skill runs after the split skill since it has outputs form the split skill. More on how skillset execution works: Skillset concepts - Azure AI Search | Microsoft Learn.

Secondly, why is performance so bad when passing raw data files into this? I am working on applications where we need a chatbot using RAG on structured data. When I use a script I made to split the files based on their structure before uploading the individual chunks as separate files, I can get decent results. When I simply shove the raw files into an index using this "Import and Vectorize" feature, I get terrible result.

Based on the shared information, we understand that splitting the data based on structure you get good results, so in this case, for your use case, structure is important. So split skill wouldn't suffice since this is based on fixed-size chunking. If this is the case, you need to consider using a custom skill to split the data in the way you need.

Currently, we don't have any native chunking skill that preserves the doc structure. This is in our roadmap but won't be available in the next few months.

You could use something similar to the script that has from the custom skill to do this: Custom Web API skill in skillsets - Azure AI Search | Microsoft Learn, or

You could consider also AI Document intelligence that preserves structure too: Build a Document Intelligence custom skill for Azure AI Search - Training | Microsoft Learn

Hope this helps, let me know if you any question on this.

Share via

Why does the default chunking algorithm used by Azure AI Search's "Import and Vectorize" feature lead to poor performance in RAG?

0 additional answers

Your answer