How to batch create a searchable pdf using azure document intelligence python API

Andrew Richardson (W) 0 Reputation points
2025-01-31T11:54:55.96+00:00

Hi All,

I have stored some PDFs in Azure Blob storage and I am trying to batch OCR these documents while also creating a searchable PDF. Unfortunately at the moment I'm getting a file with the following extension "filename.pdf.ocr.json" which wont open even If I rename it to the correct file extension. I assume I'm using the wrong python setup for the function "document_intelligence_client.begin_analyze_batch_documents()", the variable I have set for this are below:

Any suggestions on how to batch create a searchable pdf using azure document intelligence python API would be greatly appreciated

    poller = document_intelligence_client.begin_analyze_batch_documents(
        model_id="prebuilt-read",
        body=request,
        output=["pdf"],
    )
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,934 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 1,955 Reputation points Microsoft Employee
    2025-02-07T04:27:46.95+00:00

    Hello Andrew Richardson (W),

    Thanks for troubleshooting on this and i understand your frustration on this.

    Based on the Azure SDK documentation for Document Intelligence, the AnalyzeBatchDocumentsRequest class does not support the output_content_format parameter. This is confirmed by the following details:

    1. The begin_analyze_batch_documents method in the DocumentIntelligenceClient class accepts an AnalyzeBatchDocumentsRequest object, but the output_content_format parameter is not listed as a valid argument.
    2. The output_content_format parameter is supported in the begin_analyze_document method, which suggests that you should use this method instead for specifying the output format.

    Here is an example of how to use the begin_analyze_document method with the output_content_format parameter:

    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    
    
    # Define the request
    request = {
        "result_container_url": result_container_sas_url,
        "azure_blob_source": {
            "container_url": batch_training_data_container_sas_url
        }
    }
    
    # Make the request
    poller = client.begin_analyze_document(
        model_id="<your-model-id>",
        analyze_request=request,
        output_content_format="pdf"  
    )
    
    # Get the result
    result = poller.result()
    
    # Save the result as a PDF
    with open("output.pdf", "wb") as f:
        f.write(result)
    

    For more detailed information, you can refer to the official Microsoft documentation on the DocumentIntelligenceClient class and its methods.

    I hope this helps! Let me know if you have any other questions.

    If the reply was helpful, please don't forget to upvote and/or accept as answer, this can be beneficial to other community members.

    Thanks


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.