Azure blob indexer html text extraction not working as expected

Question

I'm using an azure blob indexer with a skillset pipeline and the documentation here says that html files are processed as such: "Strip HTML elements and extract text". When debugging the skillset execution, I can see that the enriched docuement "content" field still contains html elements. Why is this happening? Screenshot 2025-02-17 171717

Answer

Hello Brennan Bugbee,

Thank you for posting your question in the Microsoft Q&A forum.

Azure Cognitive Search is a powerful service that enables developers to build rich search experiences over structured and unstructured data. One of its key features is the ability to index content from Azure Blob Storage, including HTML files. According to the official documentation, the Azure Blob Indexer processes HTML files by stripping HTML elements and extracting text. However, during debugging, you may observe that the enriched document's "content" field still contains HTML elements.

The persistence of HTML elements in the "content" field during debugging can be attributed to misconfigurations in the Blob Indexer, skillset pipeline behavior, or intermediate debugging artifacts. By verifying the Blob Indexer configuration, reviewing the skillset pipeline, and inspecting the final indexed document, you can ensure that HTML elements are properly stripped. Additionally, leveraging Microsoft's documentation as below provides valuable insights into configuring and troubleshooting the Azure Cognitive Search pipeline.

Azure Cognitive Search Blob Indexer Overview: Indexing Documents in Azure Blob Storage
Blob Indexer Configuration Settings: Blob Indexer Configuration Parameters
Skillset Pipeline Documentation: Add Cognitive Skills to an Azure Cognitive Search Pipeline
Debugging Enriched Documents: Debugging Skillset Execution

If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue.

Share via

Azure blob indexer html text extraction not working as expected

1 answer

Your answer