Why does the default chunking algorithm used by Azure AI Search's "Import and Vectorize" feature lead to poor performance in RAG?

Bao, Jeremy (Cognizant) 105 Reputation points
2024-04-09T22:08:25.25+00:00

When you use the "Import and Vectorize" feature of Azure AI Search to get data from a container in a Storage Account and make an index based on it, it creates and uses this skillset.

{
  "@odata.context": ...,
  "@odata.etag": ...,
  "name": ...,
  "description": "Skillset to chunk documents and generate embeddings",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
      "name": "#1",
      "description": null,
      "context": "/document/pages/*",
      "resourceUri": ....,
      "apiKey": "<redacted>",
      "deploymentId": ...,
      "inputs": [
        {
          "name": "text",
          "source": "/document/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "embedding",
          "targetName": "vector"
        }
      ],
      "authIdentity": null
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "#2",
      "description": "Split skill to chunk documents",
      "context": "/document",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 2000,
      "pageOverlapLength": 500,
      "maximumPagesToTake": 0,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    }
  ],
  "cognitiveServices": null,
  "knowledgeStore": null,
  "indexProjections": {
    "selectors": [
      {
        "targetIndexName": ...,
        "parentKeyFieldName": "parent_id",
        "sourceContext": "/document/pages/*",
        "mappings": [
          {
            "name": "chunk",
            "source": "/document/pages/*",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "vector",
            "source": "/document/pages/*/vector",
            "sourceContext": null,
            "inputs": []
          },
          {
            "name": "title",
            "source": "/document/metadata_storage_name",
            "sourceContext": null,
            "inputs": []
          }
        ]
      }
    ],
    "parameters": {
      "projectionMode": "skipIndexingParentDocuments"
    }
  },
  "encryptionKey": null
}

Firstly, why is the embedding skill before the splitting skill? Shouldn't it be the other way around? This way, won't all chunks of each document have the same vector? This does not appear to be the case, though.

Secondly, why is performance so bad when passing raw data files into this? I am working on applications where we need a chatbot using RAG on structured data. When I use a script I made to split the files based on their structure before uploading the individual chunks as separate files, I can get decent results. When I simply shove the raw files into an index using this "Import and Vectorize" feature, I get terrible results with the chatbot either saying that it does not know the answer or providing a wrong answer much more often than it produces something valid. This seems strange, as the 500-token overlap should prevent data from being lost across chunk boundaries in most cases.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,167 questions
Azure OpenAI Service
Azure OpenAI Service
An Azure service that provides access to OpenAI’s GPT-3 models with enterprise capabilities.
3,577 questions
{count} votes

Accepted answer
  1. VenkateshDodda-MSFT 23,691 Reputation points Microsoft Employee
    2024-04-10T06:53:55.06+00:00

    @Bao, Jeremy (Cognizant) Thanks for your patience on this, I have checked with internal team and sharing the below.

    The skills in the index are not ordered in execution by the number but by the inputs and outputs. The embedding skill runs after the split skill since it has outputs form the split skill. More on how skillset execution works: Skillset concepts - Azure AI Search | Microsoft Learn.  

    Secondly, why is performance so bad when passing raw data files into this? I am working on applications where we need a chatbot using RAG on structured data. When I use a script I made to split the files based on their structure before uploading the individual chunks as separate files, I can get decent results. When I simply shove the raw files into an index using this "Import and Vectorize" feature, I get terrible result.

    Based on the shared information, we understand that splitting the data based on structure you get good results, so in this case, for your use case, structure is important. So split skill wouldn't suffice since this is based on fixed-size chunking. If this is the case, you need to consider using a custom skill to split the data in the way you need.

    Currently, we don't have any native chunking skill that preserves the doc structure. This is in our roadmap but won't be available in the next few months. 

    You could use something similar to the script that has from the custom skill to do this: Custom Web API skill in skillsets - Azure AI Search | Microsoft Learn, or

    You could consider also AI Document intelligence that preserves structure too: Build a Document Intelligence custom skill for Azure AI Search - Training | Microsoft Learn

    Hope this helps, let me know if you any question on this.

    2 people found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.