I'm working on a project that is going to index a large number of blobs from Azure storage (pdfs and images) with an OCR skill to do text extraction for use in Azure Search.
Using OCR to index all of this data is likely to cost multiple thousands of dollars, and I'm trying to figure out the best way to avoid having to redo all of the OCRing if some Azure resource (index or indexer) goes down / get's deleted / has some unrecoverable error, etc. I've got a few initial thoughts:
- Manually backup the search indices and restore them if necessary - I want to do a proof of concept here, but am unsure if this would be sufficient in the case of the indexer going down/getting deleted. Presumably this would help with some error with the index.
- Use incremental enrichment with cached content - This seems like something we'd likely do anyway, as updates to skills or other things later in the indexing pipeline would be able to reuse results from the cached enrichment skill. I'm not sure this would still adequately handle the problem of some issue with the indexer however. I had considered maybe manually backing up the cache and restoring that cache to a new indexer, however :
Each indexer is assigned a unique and immutable cache identifier that corresponds to the container it is using.> [...]> The lifecycle of the cache is managed by the indexer. If an indexer is deleted, its cache is also deleted.
- Remove the OCRing from the indexing pipeline, and store the intermediate results manually. These intermediate results would then be used for the search index. I'm guessing this would be the safest option, but I'd be losing out on some of the convenience of having the OCR enrichment be part of the search indexing. I'm also not sure what the best storage method would be for this data (json blobs in another storage container?). Nor am I sure how best to handle incrementally OCRing new documents as they come in - I could update the source for document upload, but having the indexer process new documents on a schedule is another convenience I'd like to keep if possible.