Hello Jordan Traylor,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having a UnicodeDecodeError either you use base64 or non-base64 encoding for your DocumentIntelligence while Batch OCR Local PDFs.
There are several strategies to address the issue, but the fact that the PDF works in the Document Intelligence Studio, I will directly address the root cause of the UnicodeDecodeError
related to the Azure Document Intelligence SDK and the encoding/handling of the file during programmatic submission. With this tested approach there is a thorough investigation and resolution while providing contingency plans for unresolvable files.
Firstly, the UnicodeDecodeError
typically results from handling file content incorrectly as text instead of binary. To address this:
- Always open files in binary mode (
"rb"
) when interacting with the Azure SDK to avoid misinterpretation of file content this will ensure Binary Mode for File Reading.with open("example.pdf", "rb") as file: result = client.begin_analyze_document("prebuilt-document", document=file).result()
- Confirm that you’re using the latest stable version of the Azure Document Intelligence SDK, as updates often fix encoding and compatibility issues. Refer to the official Azure SDK documentation here - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence for updates.
- Use raw binary input for PDFs unless explicitly required by your workflow. Base64 encoding can introduce unnecessary complexity.
- If the error arises with larger or complex PDFs, process smaller subsets of files to understand the SDK’s behavior and limitations.
Secondly, if a PDF works in the Document Intelligence Studio but fails in the SDK, compare the Studio's output to the SDK's output for the same file. This can highlight discrepancies, such as metadata issues or unsupported formats in API requests. You can download and review the Studio’s processed results to identify potential differences.
Thirdly, introduce robust error handling to skip problematic files, continue processing, and log errors for further debugging:
import os
from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential
# Azure credentials
endpoint = "your_endpoint"
key = "your_key"
client = DocumentAnalysisClient(endpoint, AzureKeyCredential(key))
# Directories and logging
pdf_folder = "path_to_your_pdfs"
error_log = "error_log.txt"
# Process files
for pdf in os.listdir(pdf_folder):
if pdf.endswith(".pdf"):
file_path = os.path.join(pdf_folder, pdf)
try:
with open(file_path, "rb") as f:
poller = client.begin_analyze_document("prebuilt-document", document=f)
result = poller.result()
# Process result
except Exception as e:
with open(error_log, "a") as log:
log.write(f"File: {file_path}, Error: {str(e)}\n")
This approach ensures uninterrupted processing and provides a log for debugging problematic files later.
Now, you can use an advanced debugging and analytics such as:
- Utilize Azure Monitor or built-in SDK diagnostics to log API responses and trace processing steps for failed files.
- Tools like
pdfminer
orPyMuPDF
can help identify anomalies in problematic PDFs, such as embedded fonts or invalid metadata, which may trigger errors. For an example with pdfminer:from pdfminer.high_level import extract_text text = extract_text("problematic.pdf") print(text)
- For pre-processing PDFs, you can convert problematic PDFs to standard formats using tools like
pdftotext
or Adobe Acrobat to eliminate encoding issues.
Finally, if the error persists despite these measures contact Azure Support.
For more examples and reading cheek the reference on Azure Form Recognizer Error Resolution - https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/how-to-guides/resolve-errors
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.