Indexing html & htm documents from Azure Blob Storage to Azure Search Index using an Indexer.

Question

Indexing html & htm documents from Azure Blob Storage to Azure Search Index using an Indexer.

Iordanis Kokkinidis 0

Hello everyone,

I am trying to set up an indexer that will index documents from a data source (Azure Blob Storage) to an Azure Search Index. I have also created a skillset that contains a chunking and an embedding skill. Specifically these are the "Microsoft.Skills.Text.SplitSkill" and the "Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill".

What ends up happening is that the documents get indexed in the index I have set up, but neither the chunking nor the embedding happens. The documents are indexed as is, without getting chunked, and there are no embeddings in my index. I have also set the "outputFieldMappings" to map the output of the embedding skill to the "embedding" field in the index.

Still, no luck. I will provide the Indexer and Skillset .json below.

Indexer

{
    "@odata.context": "<redacted>",
    "@odata.etag": "<redacted>",
    "name": "<redacted>",
    "description": null,
    "dataSourceName": "azureblob-1737725881239-datasource",
    "skillsetName": "document-chunk-and-embedding-skillset",
    "targetIndexName": "<redacted>",
    "disabled": null,
    "schedule": null,
    "parameters": {
      "batchSize": null,
      "maxFailedItems": null,
      "maxFailedItemsPerBatch": null,
      "base64EncodeKeys": null,
      "configuration": {
        "dataToExtract": "contentAndMetadata",
        "parsingMode": "default"
      }
    },
    "fieldMappings": [],
    "outputFieldMappings": [
      {
        "sourceFieldName": "/document/myEmbedding",
        "targetFieldName": "embedding",
        "mappingFunction": null
      }
    ],
    "cache": null,
    "encryptionKey": null
  }

Skillset

{
    "@odata.etag": "<redacted>",
    "name": "document-chunk-and-embedding-skillset",
    "description": "",
    "skills": [
      {
        "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
        "name": "Document Chunk Splitter Skill",
        "description": "This skill is used to split documents into chunks",
        "context": "/document",
        "defaultLanguageCode": "en",
        "textSplitMode": "pages",
        "maximumPageLength": 512,
        "pageOverlapLength": 102,
        "maximumPagesToTake": 0,
        "unit": "azureOpenAITokens",
        "inputs": [
          {
            "name": "text",
            "source": "/document/content",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "textItems",
            "targetName": "chunks"
          }
        ],
        "azureOpenAITokenizerParameters": {
          "encoderModelName": "cl100k_base",
          "allowedSpecialTokens": []
        }
      },
      {
        "@odata.type": "#Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill",
        "name": "Chunk Embedding Skill",
        "description": "This skill creates embeddings for each chunk created from the documents.",
        "context": "/document/chunks/*",
        "resourceUri": "<redacted>",
        "apiKey": "<redacted>",
        "deploymentId": "<redacted>",
        "dimensions": 1536,
        "modelName": "text-embedding-ada-002",
        "inputs": [
          {
            "name": "text",
            "source": "/document/chunks",
            "inputs": []
          }
        ],
        "outputs": [
          {
            "name": "embedding",
            "targetName": "myEmbedding"
          }
        ]
      }
    ],
    "cognitiveServices": {
      "@odata.type": "#Microsoft.Azure.Search.AIServicesByKey",
      "subdomainUrl": "<redacted>"
    }
  }

I would appreciate any help with this matter.

Thanks in advance!!!!

Sampath 1,260 Reputation points Microsoft External Staff

2025-02-24T13:51:42.5933333+00:00

Hello@Iordanis Kokkinidis , Just checking in to see if the provided answer helped. If this answers your query, do click "Accept the answer” for the same, which might be beneficial to other community members reading this thread. And, if you have any further queries do let us know.
Sampath 1,260 Reputation points Microsoft External Staff

2025-02-26T07:35:35.62+00:00

Hello@Iordanis Kokkinidis ,We still have not heard back from you. Just wanted to check if the answer provided below was helpful? If it answers your query, please do click Accept Answer and Yes for the answer, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

1 answer

Your answer

Sampath 1,260 Reputation points Microsoft External Staff

2025-02-24T13:51:42.5933333+00:00

Hello@Iordanis Kokkinidis , Just checking in to see if the provided answer helped. If this answers your query, do click "Accept the answer” for the same, which might be beneficial to other community members reading this thread. And, if you have any further queries do let us know.
Sampath 1,260 Reputation points Microsoft External Staff

2025-02-26T07:35:35.62+00:00

Hello@Iordanis Kokkinidis ,We still have not heard back from you. Just wanted to check if the answer provided below was helpful? If it answers your query, please do click Accept Answer and Yes for the answer, as it might be beneficial to other community members reading this thread. And, if you have any further query do let us know.

Answer 1

Sina Salam 18,951

Hello Iordanis Kokkinidis,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are trying to index HTML and HTM documents from Azure Blob Storage to an Azure Search Index using an indexer.

To fix the issue, follow these steps:

You will need to correct the Output Field Mapping - The sourceFieldName should match the actual output from the skill. Update this in the Indexer:

"outputFieldMappings": [
  {
    "sourceFieldName": "/document/chunks/*/myEmbedding",
    "targetFieldName": "embedding"
  }
]

Secondly, adjust the context in the embedding skill. Instead of:

"context": "/document/chunks/*",

Use:

"context": "/document/chunks",

Also, for proper content extraction, change "dataToExtract" to "content":

"configuration": {
  "dataToExtract": "content",
  "parsingMode": "default"
}

Finally, you can enable logs to run the following query in Azure Search Resource Logs:

AzureDiagnostics

| where Category == "IndexerExecution"

| where Message contains "Error" or Message contains "Skill"

This will help and show why the chunking or embedding might be failing.

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Iordanis Kokkinidis 0 Reputation points

2025-03-07T13:03:23.44+00:00

Hello @Sina Salam

sorry, I got a bit busy since the last time I was working on this.

I did not manage to fix this.

I got the following errors for different documents:

Operation: Enrichment.AzureOpenAIEmbeddingSkill.Chunk Embedding Skill

Message: Could not execute skill because one or more skill input was invalid.

Details: Required skill input was not of the expected type 'String'. Name: 'text', Source: '$(/document/chunks)'.

Expression language parsing issues:

Operation: There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '0'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.

Message: Could not index document because some of the data in the document was not valid.

Details: There's a mismatch in vector dimensions. The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '0'. Please ensure that the vector length matches the expected length of the vector field. Read the following documentation for more details: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-configure-compression-storage.

Operations: Projection.SearchIndex.OutputFieldMapping.embedding

Message: Could not map output field 'embedding' to search index. Check the 'outputFieldMappings' property of your indexer.

Details: Missing or empty value '/document/chunks/2/myEmbedding'.

Do you think you can help me with this?
Sina Salam 18,951 Reputation points

2025-03-10T18:23:19.3133333+00:00
Hello Iordanis Kokkinidis,

Thank you for your feedback.

Kindly check what you do, the errors are simplify:

Invalid Skill Input Type Error: Required skill input was not of the expected type 'String'. Name: 'text', Source: '$(/document/chunks)'* Cause: The skill expects a String, but /document/chunks likely resolves to an array or an object. Solution: If you're processing chunks, each chunk's content needs to be individually passed as text. Adjust the input mapping to access the text field of each chunk, assuming it's under a property like content.
"inputs": [ { "name": "text", "source": "/document/chunks/*/content" } ]
Replace content with the correct property if it's named differently and if the chunks are not correctly created, ensure the chunking process outputs content properly.

Vector Dimension Mismatch Error: The vector field 'embedding', with dimension of '1536', expects a length of '1536'. However, the provided vector has a length of '0'.* Cause: This suggests that either the content is empty, or the embedding skill isn't generating vectors correctly. Solution: Ensure that the content for each chunk isn't empty. Add a filter in your indexer to skip empty chunks and verify your embedding skill configuration:
{ "@odata.type": "#Microsoft.Skills.Text.EmbeddingSkill", "name": "my-embedding-skill", "description": "Generates embeddings from text.", "context": "/document/chunks/*", "inputs": [ { "name": "text", "source": "/document/chunks/*/content" } ], "outputs": [ { "name": "embedding", "targetName": "myEmbedding" } ] }
If the context is incorrect, it may not be processing the chunks. The /* ensures that it processes each chunk individually.

Missing or Empty Output Field Mapping Error: Could not map output field 'embedding' to search index. Missing or empty value '/document/chunks/2/myEmbedding'. Cause: The skill might not be generating the expected myEmbedding field for all chunks, possibly due to empty content or incorrect processing. Solution:

Update your output field mapping to match the correct path:

"outputFieldMappings": [ { "sourceFieldName": "/document/chunks/*/myEmbedding", "targetFieldName": "embedding" } ]
Validate that myEmbedding is being created for every chunk. If some chunks are empty, consider skipping them using conditions.

Success

Share via

Indexing html & htm documents from Azure Blob Storage to Azure Search Index using an Indexer.

1 answer

Your answer