Sporadic "InvalidContent" Error with Form Recognizer Despite Verified Valid PDF Files

Question

Subject: Sporadic "InvalidContent" Error with Form Recognizer Despite Verified Valid PDF Files

Issue Summary:

I'm experiencing a sporadic issue with Azure Form Recognizer, where I receive the following error for some PDF files:


(InvalidRequest) Invalid request.

Code: InvalidRequest

Message: Invalid request.

Inner error: {

    "code": "InvalidContent",

    "message": "The file is corrupted or format is unsupported. Refer to documentation for the list of supported formats."

}

The PDFs that are causing this error have been verified multiple times and appear to be valid. The verification includes:

Ensuring that the PDFs are accessible via a SAS token.
Confirming that they are not corrupted and contain readable content.
Rechecking the partitioned versions of the PDFs, as they are part of a larger original file split into multiple smaller files.

Steps Taken to Mitigate the Issue:

Concurrency Control: Reduced the number of workers to just one to ensure no simultaneous requests are causing overload.
Retry Logic: Implemented retry logic to reattempt analysis after a delay if the error occurs, but the error persists even after retries.
Partition Verification: Verified each partitioned PDF file to confirm readability and correctness before sending to Azure.

Code Explanation:

The workflow involves several main steps:

Splitting the Original PDF: The process starts by splitting a larger original PDF into multiple smaller sections. The split is based on specific identifiers found within the text of the document, which determine the boundaries for each section. Each section is saved as an individual PDF.
Uploading and Analyzing PDFs: Once partitioned, each PDF is uploaded to Azure Blob Storage, and its URL is then used to analyze the content using Azure Form Recognizer. The goal is to extract structured data from each PDF.
Retry Logic and Error Handling: The code includes logic to retry analyzing a document if an error, such as "InvalidContent," occurs. The retries happen after a short delay to handle potential transient issues.
- Errors are logged extensively to provide more context on the nature of the failure, such as which section failed and why. This helps in debugging and isolating potential causes.
JSON Reporting: After processing each PDF:
- A detailed JSON report is generated for each section, containing metadata like the section identifier, page count, and the extraction status.
- A general report is also updated to keep track of all processed documents, providing a summary view of what has been processed and if any errors occurred.

Code Snippet:

Here's a portion of the code that outlines the process:


import os

import json

import time

import concurrent.futures

import requests

from azure.core.credentials import AzureKeyCredential

from azure.ai.formrecognizer import DocumentAnalysisClient

from azure.core.exceptions import HttpResponseError

from logging_config import logger

# Configuration for Azure (sensitive values are hidden)

ENDPOINT_AI = os.getenv('ENDPOINT_AI')  # Azure endpoint (hidden)

KEY_AI = os.getenv('KEY_AI')  # Azure key (hidden)

MODEL_ID = "estrattore-bollette-v3"

SAS_TOKEN = os.getenv('SAS_TOKEN')  # SAS token for Azure blob (hidden)

document_analysis_client = DocumentAnalysisClient(

    endpoint=ENDPOINT_AI,

    credential=AzureKeyCredential(KEY_AI)

)

def analyze_document(blob_url: str, retry_count=1):

    try:

        logger.info(f"Starting document analysis for URL: {blob_url}")

        full_url = f"{blob_url}?{SAS_TOKEN}"

        # Check file size before processing

        response = requests.head(full_url)

        file_size = int(response.headers.get('Content-Length', 0))

        max_size = 500 * 1024 * 1024  # 500MB in bytes

        if file_size > max_size:

            logger.error(f"File too large: {file_size / 1024 / 1024:.2f}MB (max 500MB)")

            raise ValueError("File exceeds maximum size")

        # Proceed with analysis

        poller = document_analysis_client.begin_analyze_document_from_url(

            model_id=MODEL_ID,

            document_url=full_url

        )

        result = poller.result()

        if not result:

            raise ValueError("Extraction result is empty")

        logger.info(f"Successfully extracted fields for URL: {blob_url}")

        return result

    except HttpResponseError as e:

        logger.error(f"Azure API error for {blob_url}: {e.message}")

        if retry_count > 0:

            logger.info(f"Retrying extraction for {blob_url} (remaining attempts: {retry_count})")

            time.sleep(2)  # Wait before retrying

            return analyze_document(blob_url, retry_count - 1)

        raise

    except Exception as e:

        logger.error(f"Error during extraction for {blob_url}: {e}")

        if retry_count > 0:

            logger.info(f"Retrying extraction for {blob_url} (remaining attempts: {retry_count})")

            time.sleep(2)

            return analyze_document(blob_url, retry_count - 1)

        raise

Observation:

The error occurs sporadically, and different PDF files fail each time the process is run. After retrying, the previously failing PDFs may pass without any issues, suggesting that the content itself may not be the problem.

Questions:

Are there any known limitations or settings in Form Recognizer that could cause intermittent "InvalidContent" errors despite verified valid files?
Could there be other potential reasons why Azure Form Recognizer would intermittently fail on different PDF files in this way?
Are there any specific recommendations to avoid this error, such as additional preprocessing or formatting requirements for the PDF files?

Thank you for any insights or suggestions you can provide!

Share via

Sporadic "InvalidContent" Error with Form Recognizer Despite Verified Valid PDF Files

Your answer