How to batch create a searchable pdf using azure document intelligence python API

Andrew Richardson (W) 0 Reputation points
2025-01-31T11:54:55.96+00:00

Hi All,

I have stored some PDFs in Azure Blob storage and I am trying to batch OCR these documents while also creating a searchable PDF. Unfortunately at the moment I'm getting a file with the following extension "filename.pdf.ocr.json" which wont open even If I rename it to the correct file extension. I assume I'm using the wrong python setup for the function "document_intelligence_client.begin_analyze_batch_documents()", the variable I have set for this are below:

Any suggestions on how to batch create a searchable pdf using azure document intelligence python API would be greatly appreciated

    poller = document_intelligence_client.begin_analyze_batch_documents(
        model_id="prebuilt-read",
        body=request,
        output=["pdf"],
    )
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,902 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 1,070 Reputation points Microsoft Employee
    2025-02-07T04:27:46.95+00:00

    Hello Andrew Richardson (W),

    Thanks for troubleshooting on this and i understand your frustration on this.

    Based on the Azure SDK documentation for Document Intelligence, the AnalyzeBatchDocumentsRequest class does not support the output_content_format parameter. This is confirmed by the following details:

    1. The begin_analyze_batch_documents method in the DocumentIntelligenceClient class accepts an AnalyzeBatchDocumentsRequest object, but the output_content_format parameter is not listed as a valid argument.
    2. The output_content_format parameter is supported in the begin_analyze_document method, which suggests that you should use this method instead for specifying the output format.

    Here is an example of how to use the begin_analyze_document method with the output_content_format parameter:

    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.core.credentials import AzureKeyCredential
    
    
    # Define the request
    request = {
        "result_container_url": result_container_sas_url,
        "azure_blob_source": {
            "container_url": batch_training_data_container_sas_url
        }
    }
    
    # Make the request
    poller = client.begin_analyze_document(
        model_id="<your-model-id>",
        analyze_request=request,
        output_content_format="pdf"  
    )
    
    # Get the result
    result = poller.result()
    
    # Save the result as a PDF
    with open("output.pdf", "wb") as f:
        f.write(result)
    

    For more detailed information, you can refer to the official Microsoft documentation on the DocumentIntelligenceClient class and its methods.

    I hope this helps! Let me know if you have any other questions.

    If the reply was helpful, please don't forget to upvote and/or accept as answer, this can be beneficial to other community members.

    Thanks

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.