DocumentIntelligence: UnicodeDecodeError While Batch OCR Local PDFs

Question

Hello,

I have been trying to use Azure AI's DocumentIntelligence to OCR about 1,000 locally-stored PDF files. I mostly have been following guidance by @dupammi on this question: https://learn.microsoft.com/en-us/answers/questions/1661108/how-to-read-data-from-a-local-pdf-using-document-i

My problem is that, whether I use the base64 encoding that dupammi's example uses or non-base64, the code runs into a "problem" PDF and returns a UnicodeDecodeError. It is able to process about 80 PDFs before running into the problematic PDF. I assume it is a problem with the PDF itself, but I was able to analyze it in the DocumentIntelligence Studio, which worked perfectly. I merely want to replicate that success for all of my local PDFs. Does anyone know how I can fix the UnicodeDecodeError that prevents me from doing so?

Happy to provide more information if needed.

Answer

Hello Jordan Traylor,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having a UnicodeDecodeError either you use base64 or non-base64 encoding for your DocumentIntelligence while Batch OCR Local PDFs.

There are several strategies to address the issue, but the fact that the PDF works in the Document Intelligence Studio, I will directly address the root cause of the UnicodeDecodeError related to the Azure Document Intelligence SDK and the encoding/handling of the file during programmatic submission. With this tested approach there is a thorough investigation and resolution while providing contingency plans for unresolvable files.

Firstly, the UnicodeDecodeError typically results from handling file content incorrectly as text instead of binary. To address this:

Always open files in binary mode ("rb") when interacting with the Azure SDK to avoid misinterpretation of file content this will ensure Binary Mode for File Reading.
```
      with open("example.pdf", "rb") as file:
          result = client.begin_analyze_document("prebuilt-document", document=file).result()
```
Confirm that you’re using the latest stable version of the Azure Document Intelligence SDK, as updates often fix encoding and compatibility issues. Refer to the official Azure SDK documentation here - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence for updates.
Use raw binary input for PDFs unless explicitly required by your workflow. Base64 encoding can introduce unnecessary complexity.
If the error arises with larger or complex PDFs, process smaller subsets of files to understand the SDK’s behavior and limitations.

Secondly, if a PDF works in the Document Intelligence Studio but fails in the SDK, compare the Studio's output to the SDK's output for the same file. This can highlight discrepancies, such as metadata issues or unsupported formats in API requests. You can download and review the Studio’s processed results to identify potential differences.

Thirdly, introduce robust error handling to skip problematic files, continue processing, and log errors for further debugging:

import os
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Azure credentials
endpoint = "your_endpoint"
key = "your_key"
client = DocumentAnalysisClient(endpoint, AzureKeyCredential(key))
# Directories and logging
pdf_folder = "path_to_your_pdfs"
error_log = "error_log.txt"
# Process files
for pdf in os.listdir(pdf_folder):
    if pdf.endswith(".pdf"):
        file_path = os.path.join(pdf_folder, pdf)
        try:
            with open(file_path, "rb") as f:
                poller = client.begin_analyze_document("prebuilt-document", document=f)
                result = poller.result()
                # Process result
        except Exception as e:
            with open(error_log, "a") as log:
                log.write(f"File: {file_path}, Error: {str(e)}
")

This approach ensures uninterrupted processing and provides a log for debugging problematic files later.

Now, you can use an advanced debugging and analytics such as:

Utilize Azure Monitor or built-in SDK diagnostics to log API responses and trace processing steps for failed files.
Tools like pdfminer or PyMuPDF can help identify anomalies in problematic PDFs, such as embedded fonts or invalid metadata, which may trigger errors. For an example with pdfminer:
```
      from pdfminer.high_level import extract_text
      text = extract_text("problematic.pdf")
      print(text)
```
For pre-processing PDFs, you can convert problematic PDFs to standard formats using tools like pdftotext or Adobe Acrobat to eliminate encoding issues.

Finally, if the error persists despite these measures contact Azure Support.

For more examples and reading cheek the reference on Azure Form Recognizer Error Resolution - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/resolve-errors

I hope this is helpful! Do not hesitate to let me know if you have any other questions.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

DocumentIntelligence: UnicodeDecodeError While Batch OCR Local PDFs

1 answer

Your answer