Hello Hungry Scripter,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
I understand that you are having EOF occurred in Violation of Protocol / Timeout Error in Azure Document Intelligence for Specific PDFs.
The EOF error in violation of protocol (_ssl.c:2417) suggests an issue at the SSL/TLS level rather than a PDF-specific problem. Also, the timeout error could be due to large or complex PDFs, as mentioned, but could also stem from API request issues, network instability, or incorrect headers.
Therefore, to resolve the EOF error and timeout issue effectively, follow these steps:
Step 1: The error EOF occurred in violation of protocol (_ssl.c:2417) is usually related to:
- SSL/TLS misconfigurations
- Expired/mismatched certificates
- Network proxy/firewall interference
You will need to try the followings:
- Test the connection to Azure Document Intelligence using bash command:
curl -v https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview
If you see SSL_ERROR_SYSCALL
or a similar error, check SSL/TLS settings.
- Force TLS 1.2 or 1.3 (Depending on API Requirements) by making sure that your request is using the correct TLS version:
import ssl
print(ssl.OPENSSL_VERSION) # Ensure OpenSSL is updated
- In respect to Firewall and Proxy settings, if behind a corporate proxy, ensure it allows
https://<your-azure-endpoint>
. - If the API key is old or corrupted, generate a new one and retry.
Step 2: To resolve timeout error, it can occur due to:
- Large/complex PDFs > Use the async API.
- Slow API response > Increase the client timeout setting.
- Network instability > Retry with exponential backoff.
You can resolve it by the followings:
- Switch to Asynchronous API for Large PDFs, by submit the document and poll for results instead of waiting synchronously:
import requests
url = "https://<your-endpoint>/formrecognizer/documentModels/prebuilt-layout:analyze?api-version=2024-02-29-preview"
headers = {
"Ocp-Apim-Subscription-Key": "<your-key>",
"Content-Type": "application/pdf"
}
with open("document.pdf", "rb") as f:
response = requests.post(url, headers=headers, data=f, timeout=120)
print(response.status_code, response.text)
- If using Python, increase the timeout from the default (30s) to 120s to increase client Timeout in Requests:
response = requests.post(url, headers=headers, data=f, timeout=120)
- If documents exceed 4MB or 500 pages, split them using, especially Large PDFs before uploading:
pdftk input.pdf burst output page_%02d.pdf
- Convert scanned PDFs to a cleaner format using bash command:
gs -o reduced.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/screen input.pdf
Step 3: Validate PDF Compatibility with Azure, to be sure that the PDF is readable before sending it to Azure test if it can be opened with Python:
import PyPDF2
with open("document.pdf", "rb") as f:
reader = PyPDF2.PdfReader(f)
print(len(reader.pages)) # Check if readable
If the PDF fails, reprocess it:
qpdf --decrypt input.pdf output.pdf
gs -o fixed.pdf -sDEVICE=pdfwrite input.pdf
Step 4: If errors persist check for API Service Issues and check Azure Service Health: Run bash command: az monitor activity-log list --resource-group <your-rg>
or visit Azure Status page - https://status.azure.com
Secondly, if processing many PDFs, monitor API rate limits (429 Too Many Requests
errors), and contact Azure Support with detail information via your Azure Portal by raising ticket.
I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.
Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.