EOF Occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs

Hungry Scripter 0 Reputation points
2025-02-14T21:04:43.92+00:00

Encountering issues when using Azure Document Intelligence to parse certain PDF files. The errors received are:

  1. EOF occurred in violation of protocol (_ssl.c:2417)
  2. Timeout error

These errors only occur with specific PDF files, while others process successfully.

Has anyone experienced similar issues?

Could this be due to PDF encoding, encryption, or specific formatting?

Any troubleshooting tips would be greatly appreciated!

Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,927 questions
Azure Startups
Azure Startups
Azure: A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.Startups: Companies that are in their initial stages of business and typically developing a business model and seeking financing.
588 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Sina Salam 18,046 Reputation points
    2025-02-15T17:23:46.09+00:00

    Hello Hungry Scripter,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that you are having EOF occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs.

    The EOF error in violation of protocol (_ssl.c:2417) suggests an issue at the SSL/TLS level rather than a PDF-specific problem. Also, the timeout error could be due to large or complex PDFs, as mentioned, but could also stem from API request issues, network instability, or incorrect headers.

    Therefore, to resolve the EOF error and timeout issue effectively, follow these steps:

    Step 1: The error EOF occurred in violation of protocol (_ssl.c:2417) is usually related to:

    • SSL/TLS misconfigurations
    • Expired/mismatched certificates
    • Network proxy/firewall interference

    You will need to try the followings:

    1. Test the connection to Azure Document Intelligence using bash command: curl -v https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview

    If you see SSL_ERROR_SYSCALL or a similar error, check SSL/TLS settings.

    1. Force TLS 1.2 or 1.3 (Depending on API Requirements) by making sure that your request is using the correct TLS version:
         import ssl
         print(ssl.OPENSSL_VERSION)  # Ensure OpenSSL is updated
    
    
    1. In respect to Firewall and Proxy settings, if behind a corporate proxy, ensure it allows https://<your-azure-endpoint>.
    2. If the API key is old or corrupted, generate a new one and retry.

    Step 2: To resolve timeout error, it can occur due to:

    • Large/complex PDFs > Use the async API.
    • Slow API response > Increase the client timeout setting.
    • Network instability > Retry with exponential backoff.

    You can resolve it by the followings:

    1. Switch to Asynchronous API for Large PDFs, by submit the document and poll for results instead of waiting synchronously:
       import requests
       url = "https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview"
       headers = {
           "Ocp-Apim-Subscription-Key": "<your-key>",
           "Content-Type": "application/pdf"
       }
       with open("document.pdf", "rb") as f:
           response = requests.post(url, headers=headers, data=f, timeout=120)
       print(response.status_code, response.text)
    
    1. If using Python, increase the timeout from the default (30s) to 120s to increase client Timeout in Requests:
    response = requests.post(url, headers=headers, data=f, timeout=120)
    
    1. If documents exceed 4MB or 500 pages, split them using, especially Large PDFs before uploading:
    pdftk input.pdf burst output page_%02d.pdf
    
    
    1. Convert scanned PDFs to a cleaner format using bash command: gs -o reduced.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/screen input.pdf

    Step 3: Validate PDF Compatibility with Azure, to be sure that the PDF is readable before sending it to Azure test if it can be opened with Python:

     import PyPDF2
     with open("document.pdf", "rb") as f:
          reader = PyPDF2.PdfReader(f)
          print(len(reader.pages))  # Check if readable
    

    If the PDF fails, reprocess it:

      qpdf --decrypt input.pdf output.pdf
      gs -o fixed.pdf -sDEVICE=pdfwrite input.pdf
    

    Step 4: If errors persist check for API Service Issues and check Azure Service Health: Run bash command: az monitor activity-log list --resource-group <your-rg>

    or visit Azure Status page - https://status.azure.com

    Secondly, if processing many PDFs, monitor API rate limits (429 Too Many Requests errors), and contact Azure Support with detail information via your Azure Portal by raising ticket.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.