AKS/Docker: I am processing PDF's as large as 5,000 to 10,000 pages for OCR. I'm trying to run the pages concurrently in groups of 50 but running into a situtation where the memory goes to 10GB for a 200MB file.

Question

I am processing PDF's as large as 5,000 to 10,000 pages for OCR with 200 MB files. I'm trying to run the pages concurrently in groups of 50 but running into a situtation where the memory goes to 10GB for a 200MB file. I've tried cleaning up the garbage but no matter what I do, it keeps going to 10 GB of memory which becomes untenable to process on AKS.

I'm trying to process each document inside of a single pod but multi-threading it. I've found the optimal pages per thread seems to be 50 threads with 50 pages per thread. However, even with that when I go to situations with larger than 1,000 pages, my memory blows up.

Answer

Hello Adam Plager,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having memory usage spikes to 10GB for a single file; while processing larger files this is causing the solution unsustainable in an AKS environment. Also, you have tried garbage collection and memory optimization but cannot control memory growth beyond 10GB.

These are possible causes of High Memory Usage:

Many OCR libraries load the entire document into RAM instead of processing it page by page.
If each page is converted to an image before OCR, high DPI settings could significantly increase memory usage.
Running too many threads (50) may create contention, leading to excessive memory use.
If AKS has no memory limits defined, the container may consume available memory until it crashes.

Therefore, instead of forcing 50 threads per file, try reducing concurrency, using page by page streaming, setting DPI/image optimizations, and enforcing memory limits in AKS. These changes should reduce memory consumption significantly while maintaining OCR efficiency. Below are the steps to do it:

Process PDFs Page by Page Instead of Loading Entire Document:

Use streaming based processing rather than loading the full PDF into memory.
Libraries like PyMuPDF (fitz), pdfplumber, or Tesseract OCR can extract and process pages individually.

Reduce Concurrency to a Manageable Level :

Instead of 50 threads per file, experiment with 10 - 20 threads.
Use asynchronous processing to handle batches without overloading memory.

Optimize Image Processing for OCR :

Reduce DPI settings (e.g., from 300 DPI to 150 DPI) if full precision is unnecessary.
Use image compression (e.g., JPEG instead of PNG) to reduce memory overhead.

Enable Memory Limits and Auto scaling in AKS, by define resource limits in AKS YAML to cap memory usage per pod:

  resources:
    requests:
      memory: "4Gi"
    limits:
      memory: "8Gi"

Use Horizontal Pod Autoscaler (HPA) to scale pods based on memory usage dynamically.

Debug and Monitor Memory Usage:

Use Azure Monitor or Prometheus/Grafana to track memory utilization.
Profile the process using memory_profiler in Python or .NET performance tools to identify memory leaks.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

AKS/Docker: I am processing PDF's as large as 5,000 to 10,000 pages for OCR. I'm trying to run the pages concurrently in groups of 50 but running into a situtation where the memory goes to 10GB for a 200MB file.

1 answer

Your answer