DocumentIntelligence: UnicodeDecodeError While Batch OCR Local PDFs

Jordan Traylor 0 Reputation points
2024-12-19T17:32:05.6433333+00:00

Hello,

I have been trying to use Azure AI's DocumentIntelligence to OCR about 1,000 locally-stored PDF files. I mostly have been following guidance by @dupammi on this question: https://learn.microsoft.com/en-us/answers/questions/1661108/how-to-read-data-from-a-local-pdf-using-document-i

My problem is that, whether I use the base64 encoding that dupammi's example uses or non-base64, the code runs into a "problem" PDF and returns a UnicodeDecodeError. It is able to process about 80 PDFs before running into the problematic PDF. I assume it is a problem with the PDF itself, but I was able to analyze it in the DocumentIntelligence Studio, which worked perfectly. I merely want to replicate that success for all of my local PDFs. Does anyone know how I can fix the UnicodeDecodeError that prevents me from doing so?

Happy to provide more information if needed.

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,842 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 15,011 Reputation points
    2024-12-20T14:47:17.4766667+00:00

    Hello Jordan Traylor,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having a UnicodeDecodeError either you use base64 or non-base64 encoding for your DocumentIntelligence while Batch OCR Local PDFs.

    There are several strategies to address the issue, but the fact that the PDF works in the Document Intelligence Studio, I will directly address the root cause of the UnicodeDecodeError related to the Azure Document Intelligence SDK and the encoding/handling of the file during programmatic submission. With this tested approach there is a thorough investigation and resolution while providing contingency plans for unresolvable files.

    Firstly, the UnicodeDecodeError typically results from handling file content incorrectly as text instead of binary. To address this:

    1. Always open files in binary mode ("rb") when interacting with the Azure SDK to avoid misinterpretation of file content this will ensure Binary Mode for File Reading.
            with open("example.pdf", "rb") as file:
                result = client.begin_analyze_document("prebuilt-document", document=file).result()
      
    2. Confirm that you’re using the latest stable version of the Azure Document Intelligence SDK, as updates often fix encoding and compatibility issues. Refer to the official Azure SDK documentation here - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence for updates.
    3. Use raw binary input for PDFs unless explicitly required by your workflow. Base64 encoding can introduce unnecessary complexity.
    4. If the error arises with larger or complex PDFs, process smaller subsets of files to understand the SDK’s behavior and limitations.

    Secondly, if a PDF works in the Document Intelligence Studio but fails in the SDK, compare the Studio's output to the SDK's output for the same file. This can highlight discrepancies, such as metadata issues or unsupported formats in API requests. You can download and review the Studio’s processed results to identify potential differences.

    Thirdly, introduce robust error handling to skip problematic files, continue processing, and log errors for further debugging:

    import os
    from azure.ai.formrecognizer import DocumentAnalysisClient
    from azure.core.credentials import AzureKeyCredential
    # Azure credentials
    endpoint = "your_endpoint"
    key = "your_key"
    client = DocumentAnalysisClient(endpoint, AzureKeyCredential(key))
    # Directories and logging
    pdf_folder = "path_to_your_pdfs"
    error_log = "error_log.txt"
    # Process files
    for pdf in os.listdir(pdf_folder):
        if pdf.endswith(".pdf"):
            file_path = os.path.join(pdf_folder, pdf)
            try:
                with open(file_path, "rb") as f:
                    poller = client.begin_analyze_document("prebuilt-document", document=f)
                    result = poller.result()
                    # Process result
            except Exception as e:
                with open(error_log, "a") as log:
                    log.write(f"File: {file_path}, Error: {str(e)}\n")
    

    This approach ensures uninterrupted processing and provides a log for debugging problematic files later.

    Now, you can use an advanced debugging and analytics such as:

    1. Utilize Azure Monitor or built-in SDK diagnostics to log API responses and trace processing steps for failed files.
    2. Tools like pdfminer or PyMuPDF can help identify anomalies in problematic PDFs, such as embedded fonts or invalid metadata, which may trigger errors. For an example with pdfminer:
            from pdfminer.high_level import extract_text
            text = extract_text("problematic.pdf")
            print(text)
      
    3. For pre-processing PDFs, you can convert problematic PDFs to standard formats using tools like pdftotext or Adobe Acrobat to eliminate encoding issues.

    Finally, if the error persists despite these measures contact Azure Support.

    For more examples and reading cheek the reference on Azure Form Recognizer Error Resolution - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/resolve-errors

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.