EOF Occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs

Hungry Scripter 0

Encountering issues when using Azure Document Intelligence to parse certain PDF files. The errors received are:

EOF occurred in violation of protocol (_ssl.c:2417)
Timeout error

These errors only occur with specific PDF files, while others process successfully.

Has anyone experienced similar issues?

Could this be due to PDF encoding, encryption, or specific formatting?

Any troubleshooting tips would be greatly appreciated!

Saideep Anchuri 2,370 Reputation points Microsoft Vendor

2025-02-18T09:14:11.5066667+00:00

Hi Hungry Scripter

Just checking in to see if the below answer provided by @ Sina Salam helped.

Thank You.

1 answer

Sina Salam 18,046

Hello Hungry Scripter,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that you are having EOF occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs.

The EOF error in violation of protocol (_ssl.c:2417) suggests an issue at the SSL/TLS level rather than a PDF-specific problem. Also, the timeout error could be due to large or complex PDFs, as mentioned, but could also stem from API request issues, network instability, or incorrect headers.

Therefore, to resolve the EOF error and timeout issue effectively, follow these steps:

Step 1: The error EOF occurred in violation of protocol (_ssl.c:2417) is usually related to:

SSL/TLS misconfigurations
Expired/mismatched certificates
Network proxy/firewall interference

You will need to try the followings:

Test the connection to Azure Document Intelligence using bash command: curl -v https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview

If you see SSL_ERROR_SYSCALL or a similar error, check SSL/TLS settings.

Force TLS 1.2 or 1.3 (Depending on API Requirements) by making sure that your request is using the correct TLS version:

     import ssl
     print(ssl.OPENSSL_VERSION)  # Ensure OpenSSL is updated

In respect to Firewall and Proxy settings, if behind a corporate proxy, ensure it allows https://<your-azure-endpoint>.
If the API key is old or corrupted, generate a new one and retry.

Step 2: To resolve timeout error, it can occur due to:

Large/complex PDFs > Use the async API.
Slow API response > Increase the client timeout setting.
Network instability > Retry with exponential backoff.

You can resolve it by the followings:

Switch to Asynchronous API for Large PDFs, by submit the document and poll for results instead of waiting synchronously:

   import requests
   url = "https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview"
   headers = {
       "Ocp-Apim-Subscription-Key": "<your-key>",
       "Content-Type": "application/pdf"
   }
   with open("document.pdf", "rb") as f:
       response = requests.post(url, headers=headers, data=f, timeout=120)
   print(response.status_code, response.text)

If using Python, increase the timeout from the default (30s) to 120s to increase client Timeout in Requests:

response = requests.post(url, headers=headers, data=f, timeout=120)

If documents exceed 4MB or 500 pages, split them using, especially Large PDFs before uploading:

pdftk input.pdf burst output page_%02d.pdf

Convert scanned PDFs to a cleaner format using bash command: gs -o reduced.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/screen input.pdf

Step 3: Validate PDF Compatibility with Azure, to be sure that the PDF is readable before sending it to Azure test if it can be opened with Python:

 import PyPDF2
 with open("document.pdf", "rb") as f:
      reader = PyPDF2.PdfReader(f)
      print(len(reader.pages))  # Check if readable

If the PDF fails, reprocess it:

  qpdf --decrypt input.pdf output.pdf
  gs -o fixed.pdf -sDEVICE=pdfwrite input.pdf

Step 4: If errors persist check for API Service Issues and check Azure Service Health: Run bash command: az monitor activity-log list --resource-group <your-rg>

or visit Azure Status page - https://status.azure.com

Secondly, if processing many PDFs, monitor API rate limits (429 Too Many Requests errors), and contact Azure Support with detail information via your Azure Portal by raising ticket.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Hungry Scripter 0 Reputation points

2025-02-18T21:47:12.87+00:00
Hi @Sina Salam

Thank you for your detailed response.

I really appreciate it.

I am afraid your answer might not fully address my issue.

Here is my python code.

try: # Initialize the client document_client = DocumentIntelligenceClient( endpoint=AZURE_DOCUMENT_ENDPOINT, credential=AzureKeyCredential(AZURE_DOCUMENT_KEY) ) print("Begin analyzing document using Document Intelligence...") # Process the PDF poller = document_client.begin_analyze_document( "prebuilt-layout", body=pdf_content, // bytes content_type="application/pdf", output_content_format=DocumentContentFormat.MARKDOWN ) result = poller.result() except Exception as e: raise

As you can see, it's a fairly simple code.

The challenge is that while it works for some PDFs, it throws the EOF error or Timeout error for certain others.

I’d like to understand the exact cause of this issue and how I can resolve it.

Any insights would be greatly appreciated.

Thank you!
Sina Salam 18,046 Reputation points

2025-02-18T23:03:04.34+00:00
Hello Hungry Scripter,

Thank you for your feedback.

Good to know that it's not all the PDFs as I said, and also it is good to know you already has a working implementation that processes some PDFs but fails on others.

If some PDFs work and others fail, the issue is likely due to:

Encryption

Scanned images

Large embedded objects

Corrupt PDF structure

The best solution is to:

Check encryption & decrypt if necessary.

Detect scanned PDFs & apply OCR.

Reduce file size & simplify complex PDFs.

Use asynchronous processing to avoid timeouts.

Check through, if you need any steps on the above or code let me know. I will be glad to help.

Success.
Hungry Scripter 0 Reputation points

2025-02-19T00:33:07.94+00:00

Hi Sina Salam

Thank you for your quick and detailed response!

The PDF file is 31MB and contains some scanned images.

I appreciate your insights, and I'll try your suggestions to see if they resolve the issue.

If possible, could you share some code examples for these steps? That would be really helpful.

Thanks again!

Saideep Anchuri 2,370 Microsoft Vendor

Hello Hungry Scripter,

Here are steps:

You can use the PyMuPDF library (also known as fitz) to extract text from a PDF:

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text = ""
    for page_num in range(document.page_count):
        page = document.load_page(page_num)
        text += page.get_text("text")
    return text

pdf_path = "path/to/your/pdf_file.pdf"
pdf_text = extract_text_from_pdf(pdf_path)
print(pdf_text)

If you need to compress the PDF, you can use the PyMuPDF library to save it with a reduced quality:

import fitz
def compress_pdf(pdf_path, output_path, quality=50):
    document = fitz.open(pdf_path)
    document.save(output_path, garbage=4, deflate=True, quality=quality)
pdf_path = "path/to/your/pdf_file.pdf"
output_path = "path/to/compressed_pdf.pdf"
compress_pdf(pdf_path, output_path, quality=70)

If your PDF contains scanned images, you can use the pytesseract library for OCR:

import fitz
import pytesseract
from PIL import Image
import io

def extract_text_from_scanned_pdf(pdf_path):
    document = fitz.open(pdf_path)
    text = ""
    for page_num in range(document.page_count):
        page = document.load_page(page_num)
        pix = page.get_pixmap()
        img = Image.open(io.BytesIO(pix.tobytes()))
        text += pytesseract.image_to_string(img)
    return text

pdf_path = "path/to/your/pdf_file.pdf"
scanned_text = extract_text_from_scanned_pdf(pdf_path)
print(scanned_text)

Kinly refer below link: https://github.com/tesseract-ocr/tesseract

Thank You.

Sina Salam 18,046

Hello Hungry Scripter,

Thank you for your feedback.

The below are the code as requested:

# Check if the PDF is Encrypted
from PyPDF2 import PdfReader
with open("document.pdf", "rb") as f:
    reader = PdfReader(f)
    if reader.is_encrypted:
        print("PDF is encrypted. Decrypting...")
        reader.decrypt("")  # Try empty password or actual password
    print(f"PDF has {len(reader.pages)} pages.")
# If encrypted, try decrypting before sending to Azure.

# Check if the PDF is Scanned (Non-Selectable Text)
# Some PDFs are scanned images, and Azure Document Intelligence models may 
# not handle them well.

from pdf2image import convert_from_path
images = convert_from_path("document.pdf")
if images:
    print("This is a scanned PDF. Consider OCR preprocessing.")

# If it's a scanned PDF, preprocess it with OCR before sending it.
# Optimize PDF Size and Format to reduce PDF File Size
#Using bash: gs -o optimized.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/screen document.pdf
# This creates a smaller, more efficient PDF.

# Flatten Complex PDFs (Remove Embedded Objects)
# Using bash: qpdf --linearize document.pdf --replace-input
# This simplifies the PDF structure.

# Thirdly, test the PDF with Azure Document Intelligence**

import requests
url = "https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview"
headers = {
    "Ocp-Apim-Subscription-Key": "<your-key>",
    "Content-Type": "application/pdf"
}
with open("optimized.pdf", "rb") as f:
    response = requests.post(url, headers=headers, data=f, timeout=120)
print(response.status_code, response.text)

# If this succeeds, the issue was due to PDF formatting

# 4. Use the Asynchronous API for Large Documents**
poller = document_client.begin_analyze_document(
    "prebuilt-layout",
    body=pdf_content,
    content_type="application/pdf",
    output_content_format=DocumentContentFormat.MARKDOWN
)
result = poller.result()  # Poll until processing completes

NOTE: Read the comments for actions, the above should reduce timeout.

The below is an optimizer of the above assisted by myAI Bot:

from PyPDF2 import PdfReader
from pdf2image import convert_from_path
import pytesseract
import requests
import io
from PIL import Image
import subprocess
PDF_PATH = "document.pdf"
OPTIMIZED_PDF_PATH = "optimized.pdf"
AZURE_ENDPOINT = "https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview"
API_KEY = "<your-key>"
# Step 1: Check if PDF is encrypted
def is_pdf_encrypted(pdf_path):
    with open(pdf_path, "rb") as f:
        reader = PdfReader(f)
        return reader.is_encrypted
# Step 2: Check if PDF is scanned (image-based)
def is_pdf_scanned(pdf_path):
    images = convert_from_path(pdf_path)
    return len(images) > 0
# Step 3: Compress & optimize PDF
def compress_pdf(input_pdf, output_pdf):
    subprocess.run(["gs", "-o", output_pdf, "-sDEVICE=pdfwrite", "-dPDFSETTINGS=/screen", input_pdf])
# Step 4: Extract text from scanned PDF using OCR
def extract_text_with_ocr(pdf_path):
    images = convert_from_path(pdf_path)
    text = ""
    for img in images:
        text += pytesseract.image_to_string(img)
    return text
# Step 5: Upload PDF to Azure Document Intelligence
def upload_to_azure(pdf_path):
    headers = {
        "Ocp-Apim-Subscription-Key": API_KEY,
        "Content-Type": "application/pdf"
    }
    with open(pdf_path, "rb") as f:
        response = requests.post(AZURE_ENDPOINT, headers=headers, data=f, timeout=120)
    return response.status_code, response.text
# Execution
if is_pdf_encrypted(PDF_PATH):
    print("PDF is encrypted. Please decrypt before processing.")
elif is_pdf_scanned(PDF_PATH):
    print("Scanned PDF detected. Running OCR...")
    extracted_text = extract_text_with_ocr(PDF_PATH)
    print("Extracted Text:", extracted_text)
else:
    print("Compressing PDF...")
    compress_pdf(PDF_PATH, OPTIMIZED_PDF_PATH)
    print("Uploading optimized PDF to Azure...")
    status, response = upload_to_azure(OPTIMIZED_PDF_PATH)
    print(f"Azure Response: {status} - {response}")

Success

Success.

Share via

EOF Occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs

1 answer

Your answer