Azure blob indexer html text extraction not working as expected

Brennan Bugbee 0 Reputation points
2025-02-17T22:21:41.19+00:00

I'm using an azure blob indexer with a skillset pipeline and the documentation here says that html files are processed as such: "Strip HTML elements and extract text". When debugging the skillset execution, I can see that the enriched docuement "content" field still contains html elements. Why is this happening?Screenshot 2025-02-17 171717

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,195 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Suwarna S Kale 786 Reputation points
    2025-02-19T02:35:44.2833333+00:00

    Hello Brennan Bugbee,

    Thank you for posting your question in the Microsoft Q&A forum.

     Azure Cognitive Search is a powerful service that enables developers to build rich search experiences over structured and unstructured data. One of its key features is the ability to index content from Azure Blob Storage, including HTML files. According to the official documentation, the Azure Blob Indexer processes HTML files by stripping HTML elements and extracting text. However, during debugging, you may observe that the enriched document's "content" field still contains HTML elements.

    The persistence of HTML elements in the "content" field during debugging can be attributed to misconfigurations in the Blob Indexer, skillset pipeline behavior, or intermediate debugging artifacts. By verifying the Blob Indexer configuration, reviewing the skillset pipeline, and inspecting the final indexed document, you can ensure that HTML elements are properly stripped. Additionally, leveraging Microsoft's documentation as below provides valuable insights into configuring and troubleshooting the Azure Cognitive Search pipeline.

    If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.