How to batch create a searchable pdf using azure document intelligence python API

Andrew Richardson (W) 0 Reputation points
2025-01-31T11:54:55.96+00:00

Hi All,

I have stored some PDFs in Azure Blob storage and I am trying to batch OCR these documents while also creating a searchable PDF. Unfortunately at the moment I'm getting a file with the following extension "filename.pdf.ocr.json" which wont open even If I rename it to the correct file extension. I assume I'm using the wrong python setup for the function "document_intelligence_client.begin_analyze_batch_documents()", the variable I have set for this are below:

Any suggestions on how to batch create a searchable pdf using azure document intelligence python API would be greatly appreciated

    poller = document_intelligence_client.begin_analyze_batch_documents(
        model_id="prebuilt-read",
        body=request,
        output=["pdf"],
    )
Azure AI Document Intelligence
Azure AI Document Intelligence
An Azure service that turns documents into usable data. Previously known as Azure Form Recognizer.
1,895 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Vikram Singh 980 Reputation points Microsoft Employee
    2025-02-05T07:33:41.6766667+00:00

    Hello Andrew Richardson (W),

    Thanks for sharing the code sample. Since you're following the official sample and using AnalyzeBatchDocumentsRequest, the JSON output suggests that the API isn't generating a searchable PDF. To resolve this, can you try out the below to ensure that the request includes output_content_format="pdf" so that the output is returned as a searchable PDF instead of JSON.

    request = AnalyzeBatchDocumentsRequest(    
        result_container_url=result_container_sas_url,
        azure_blob_source=AzureBlobContentSource(
            container_url=batch_training_data_container_sas_url
        ),
        output_content_format="pdf" # Ensure this is included
    )
    

    Do let me know if you are still the facing the issue.

    Thanks.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.