Hello Brennan Bugbee,
Thank you for posting your question in the Microsoft Q&A forum.
Azure Cognitive Search is a powerful service that enables developers to build rich search experiences over structured and unstructured data. One of its key features is the ability to index content from Azure Blob Storage, including HTML files. According to the official documentation, the Azure Blob Indexer processes HTML files by stripping HTML elements and extracting text. However, during debugging, you may observe that the enriched document's "content" field still contains HTML elements.
The persistence of HTML elements in the "content" field during debugging can be attributed to misconfigurations in the Blob Indexer, skillset pipeline behavior, or intermediate debugging artifacts. By verifying the Blob Indexer configuration, reviewing the skillset pipeline, and inspecting the final indexed document, you can ensure that HTML elements are properly stripped. Additionally, leveraging Microsoft's documentation as below provides valuable insights into configuring and troubleshooting the Azure Cognitive Search pipeline.
- Azure Cognitive Search Blob Indexer Overview: Indexing Documents in Azure Blob Storage
- Blob Indexer Configuration Settings: Blob Indexer Configuration Parameters
- Skillset Pipeline Documentation: Add Cognitive Skills to an Azure Cognitive Search Pipeline
- Debugging Enriched Documents: Debugging Skillset Execution
If the above answer helped, please do not forget to "Accept Answer" as this may help other community members to refer the info if facing a similar issue.