Azure AI Search: Why is OCR Reprocessing All Pages on Incremental Update?

mathias Herbaux 0 Reputation points
2024-11-05T16:21:37.8133333+00:00

Hello,

I'm experimenting with Azure AI Search for a new feature in our product. I'm running into a problem where while I've activated the incremental enrichment, skills that are not supposed to be executed are executed.

Our clients have PDF documents that need to be indexed. We have different kind of content:

  • Full scanned documents
  • Partially scanned documents: last page containing a signature is scanned
  • Not scanned at all

I've set up an indexer with the incremental enrichment cache activated. The skillset consists in:

  • document cracking
  • ocr skill
  • merge skill
  • custom skill (extract metadata)
  • split skill
  • embedding skill Once all my documents in my blob storage are indexed, I update one blob metadata. I expected that I wouldn't see OCR running, but this results on 131 pages processed on cognitive services (exact number of images in the PDF)

I checked in the cached data, and I've found that for this document I have 262 images in the binary folder.

Somehow, something has invalidated the cache and I wonder what

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,084 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.