Indexing html & htm documents from Azure Blob Storage to Azure Search Index using an Indexer.

Iordanis Kokkinidis 0 Reputation points
2025-02-11T08:29:03.0666667+00:00

Hello everyone,

I am trying to set up an indexer that will index documents from a data source (Azure Blob Storage) to an Azure Search Index. I have also created a skillset that contains a chunking and an embedding skill. Specifically these are the "Microsoft.Skills.Text.SplitSkill" and the "Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill".

What ends up happening is that the documents get indexed in the index I have set up, but neither the chunking nor the embedding happens. The documents are indexed as is, without getting chunked, and there are no embeddings in my index. I have also set the "outputFieldMappings" to map the output of the embedding skill to the "embedding" field in the index.

Still, no luck. I will provide the Indexer and Skillset .json below.

Indexer

{
    "@odata.context": "<redacted>",
    "@odata.etag": "<redacted>",
    "name": "<redacted>",
    "description": null,
    "dataSourceName": "azureblob-1737725881239-datasource",
    "skillsetName": "document-chunk-and-embedding-skillset",
    "targetIndexName": "<redacted>",
    "disabled": null,
    "schedule": null,
    "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "base64EncodeKeys": null,
      "configuration": {
        "dataToExtract": "contentAndMetadata",
        "parsingMode": "default"
      }
    },
    "fieldMappings": [],
    "outputFieldMappings": [
      {
        "sourceFieldName": "/document/myEmbedding",
        "targetFieldName": "embedding",
        "mappingFunction": null
      }
    ],
    "cache": null,
    "encryptionKey": null
  }

Skillset

{
    "@odata.etag": "<redacted>",
    "name": "document-chunk-and-embedding-skillset",
    "description": "",
    "skills": [
      {
        "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
        "name": "Document Chunk Splitter Skill",
        "description": "This skill is used to split documents into chunks",
        "context": "/document",
        "defaultLanguageCode": "en",
        "textSplitMode": "pages",
        "maximumPageLength": 512,
        "pageOverlapLength": 102,
        "maximumPagesToTake": 0,
        "unit": "azureOpenAITokens",
        "inputs": [
          {
            "name": "text",
            "source": "/document/content",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "textItems",
            "targetName": "chunks"
          }
        ],
        "azureOpenAITokenizerParameters": {
          "encoderModelName": "cl100k_base",
          "allowedSpecialTokens": []
        }
      },
      {
        "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
        "name": "Chunk Embedding Skill",
        "description": "This skill creates embeddings for each chunk created from the documents.",
        "context": "/document/chunks/*",
        "resourceUri": "<redacted>",
        "apiKey": "<redacted>",
        "deploymentId": "<redacted>",
        "dimensions": 1536,
        "modelName": "text-embedding-ada-002",
        "inputs": [
          {
            "name": "text",
            "source": "/document/chunks",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "embedding",
            "targetName": "myEmbedding"
          }
        ]
      }
    ],
    "cognitiveServices": {
      "@odata.type": "#Microsoft.Azure.Search.AIServicesByKey",
      "subdomainUrl": "<redacted>"
    }
  }

I would appreciate any help with this matter.

Thanks in advance!!!!

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,184 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,081 questions
Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,122 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 17,571 Reputation points
    2025-02-11T12:59:24.7+00:00

    Hello Iordanis Kokkinidis,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are trying to index HTML and HTM documents from Azure Blob Storage to an Azure Search Index using an indexer.

    To fix the issue, follow these steps:

    You will need to correct the Output Field Mapping - The sourceFieldName should match the actual output from the skill. Update this in the Indexer:

    "outputFieldMappings": [
      {
        "sourceFieldName": "/document/chunks/*/myEmbedding",
        "targetFieldName": "embedding"
      }
    ]
    

    Secondly, adjust the context in the embedding skill. Instead of:

    "context": "/document/chunks/*",

    Use:

    "context": "/document/chunks",

    Also, for proper content extraction, change "dataToExtract" to "content":

    "configuration": {
      "dataToExtract": "content",
      "parsingMode": "default"
    }
    

    Finally, you can enable logs to run the following query in Azure Search Resource Logs:

    AzureDiagnostics

    | where Category == "IndexerExecution"

    | where Message contains "Error" or Message contains "Skill"

    This will help and show why the chunking or embedding might be failing.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.