Hello Adam Plager,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having memory usage spikes to 10GB for a single file; while processing larger files this is causing the solution unsustainable in an AKS environment. Also, you have tried garbage collection and memory optimization but cannot control memory growth beyond 10GB.
These are possible causes of High Memory Usage:
- Many OCR libraries load the entire document into RAM instead of processing it page by page.
- If each page is converted to an image before OCR, high DPI settings could significantly increase memory usage.
- Running too many threads (50) may create contention, leading to excessive memory use.
- If AKS has no memory limits defined, the container may consume available memory until it crashes.
Therefore, instead of forcing 50 threads per file, try reducing concurrency, using page by page streaming, setting DPI/image optimizations, and enforcing memory limits in AKS. These changes should reduce memory consumption significantly while maintaining OCR efficiency. Below are the steps to do it:
- Process PDFs Page by Page Instead of Loading Entire Document:
- Use streaming based processing rather than loading the full PDF into memory.
- Libraries like PyMuPDF (fitz), pdfplumber, or Tesseract OCR can extract and process pages individually.
- Reduce Concurrency to a Manageable Level :
- Instead of 50 threads per file, experiment with 10 - 20 threads.
- Use asynchronous processing to handle batches without overloading memory.
- Optimize Image Processing for OCR :
- Reduce DPI settings (e.g., from 300 DPI to 150 DPI) if full precision is unnecessary.
- Use image compression (e.g., JPEG instead of PNG) to reduce memory overhead.
- Enable Memory Limits and Auto scaling in AKS, by define resource limits in AKS YAML to cap memory usage per pod:
resources:
requests:
memory: "4Gi"
limits:
memory: "8Gi"
Use Horizontal Pod Autoscaler (HPA) to scale pods based on memory usage dynamically.
- Debug and Monitor Memory Usage:
- Use Azure Monitor or Prometheus/Grafana to track memory utilization.
- Profile the process using
memory_profiler
in Python or .NET performance tools to identify memory leaks.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.