How to batch create a searchable pdf using azure document intelligence python API

Andrew Richardson (W) 0

Hi All,

I have stored some PDFs in Azure Blob storage and I am trying to batch OCR these documents while also creating a searchable PDF. Unfortunately at the moment I'm getting a file with the following extension "filename.pdf.ocr.json" which wont open even If I rename it to the correct file extension. I assume I'm using the wrong python setup for the function "document_intelligence_client.begin_analyze_batch_documents()", the variable I have set for this are below:

Any suggestions on how to batch create a searchable pdf using azure document intelligence python API would be greatly appreciated

    poller = document_intelligence_client.begin_analyze_batch_documents(
        model_id="prebuilt-read",
        body=request,
        output=["pdf"],
    )

Vikram Singh 1,070 Reputation points Microsoft Employee

2025-01-31T12:51:12.24+00:00

Hi Andrew Richardson (W),

Thank you for reaching out to Microsoft Q&A, and apologies for the inconvenience.

It seems you're using the Azure Document Intelligence API to batch OCR PDFs, but the output isn't in the expected searchable PDF format. To resolve this, please ensure you're using the begin_analyze_batch_documents method with the correct input and output settings. Specifically, set the contentType to application/pdf and the output parameter to ["pdf"]. Additionally, ensure the output_content_format is set to "pdf" to generate a searchable PDF.

Reference: Double-check the Azure documentation for the exact syntax to return PDF output: Azure Document Intelligence API.

Let me know if you're still facing issues.
Andrew Richardson (W) 0 Reputation points

2025-01-31T13:52:49.8233333+00:00
Hi Vikram,

Thanks for this info , I tried as you recommended but I seem to be getting an error "azure.core.exceptions.HttpResponseError: (UnsupportedMediaType) Request content type is not supported.

Code: UnsupportedMediaType"

This is the code I used, looked at the documentation but didnt find anything to further resolve this:

poller = document_intelligence_client.begin_analyze_batch_documents( model_id="prebuilt-read", body=request, output=["pdf"], content_type="application/pdf", output_content_format="pdf" )
Vikram Singh 1,070 Reputation points Microsoft Employee

2025-02-01T05:35:42.1066667+00:00
Hi Andrew Richardson (W),

Thanks for trying the suggestion. The error (UnsupportedMediaType) Request content type is not supported indicates an issue with the content_type parameter. The begin_analyze_batch_documents method does not require content_type="application/pdf". Instead, ensure the request body is correctly formatted and output_content_format is set to "pdf".

Can you try the below code:

# Define the request body request = { "azureBlobSource": { "containerUrl": "https://<your-storage-account>.blob.core.windows.net/<your-container>" }, "outputContentFormat": "pdf" } # Start the batch analysis poller = document_intelligence_client.begin_analyze_batch_documents( model_id="prebuilt-read", body=request, output=["pdf"] )

Let me know if the issue persists!
Vikram Singh 1,070 Reputation points Microsoft Employee

2025-02-04T04:21:38.4633333+00:00

Hi Andrew Richardson (W),

Greetings.

Just following up to check if my suggestion helped. Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Thank you
Andrew Richardson (W) 0 Reputation points

2025-02-04T16:48:01.4466667+00:00
Hi Vikram, I changed the body to suit the format you suggested but it still only provided a json output. I am following this code sample for batching documents https://github.com/Azure/azure-sdk-for-python/blob/azure-ai-documentintelligence_1.0.0/sdk/documentintelligence/azure-ai-documentintelligence/samples/sample_analyze_batch_documents.py

This example uses a different type of request body to the one you suggested utilising "AnalyzeBatchDocumentsRequest"

request = AnalyzeBatchDocumentsRequest( result_container_url=result_container_sas_url, azure_blob_source=AzureBlobContentSource( container_url=batch_training_data_container_sas_url, ), )
Vikram Singh 1,070 Reputation points Microsoft Employee

2025-02-05T07:33:41.6766667+00:00
Hello Andrew Richardson (W),

Thanks for sharing the code sample. Since you're following the official sample and using AnalyzeBatchDocumentsRequest, the JSON output suggests that the API isn't generating a searchable PDF. To resolve this, can you try out the below to ensure that the request includes output_content_format="pdf" so that the output is returned as a searchable PDF instead of JSON.

request = AnalyzeBatchDocumentsRequest( result_container_url=result_container_sas_url, azure_blob_source=AzureBlobContentSource( container_url=batch_training_data_container_sas_url ), output_content_format="pdf" # Ensure this is included )

Do let me know if you are still the facing the issue.

Thanks.

Andrew Richardson (W) 0

Hi Vikram,

Tried this again and got the following error, any further suggestions welcome but starting to think this isnt possible

Thanks

Andrew.

Traceback (most recent call last):
  File "c:\Github Repo\Azure-Document-Integellience-OCR-Scripts\DocumentAIBatch.py", line 61, in <module>
    analyze_batch_docs()
  File "c:\Github Repo\Azure-Document-Integellience-OCR-Scripts\DocumentAIBatch.py", line 17, in analyze_batch_docs
    request = AnalyzeBatchDocumentsRequest(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Github Repo\Azure-Document-Integellience-OCR-Scripts\.venv\Lib\site-packages\azure\ai\documentintelligence\models\_models.py", line 175, in __init__
    super().__init__(*args, **kwargs)
  File "C:\Github Repo\Azure-Document-Integellience-OCR-Scripts\.venv\Lib\site-packages\azure\ai\documentintelligence\_model_base.py", line 564, in __init__
    raise TypeError(f"{class_name}.__init__() got an unexpected keyword argument '{non_attr_kwargs[0]}'")
TypeError: AnalyzeBatchDocumentsRequest.__init__() got an unexpected keyword argument 'output_content_format'
(base) PS C:\Github Repo\Azure-Document-Integellience-OCR-Scripts>

1 answer

Vikram Singh 1,070 Reputation points Microsoft Employee

2025-02-07T04:27:46.95+00:00
Hello Andrew Richardson (W),

Thanks for troubleshooting on this and i understand your frustration on this.

Based on the Azure SDK documentation for Document Intelligence, the AnalyzeBatchDocumentsRequest class does not support the output_content_format parameter. This is confirmed by the following details:

The begin_analyze_batch_documents method in the DocumentIntelligenceClient class accepts an AnalyzeBatchDocumentsRequest object, but the output_content_format parameter is not listed as a valid argument.

The output_content_format parameter is supported in the begin_analyze_document method, which suggests that you should use this method instead for specifying the output format.

Here is an example of how to use the begin_analyze_document method with the output_content_format parameter:

from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.core.credentials import AzureKeyCredential # Define the request request = { "result_container_url": result_container_sas_url, "azure_blob_source": { "container_url": batch_training_data_container_sas_url } } # Make the request poller = client.begin_analyze_document( model_id="<your-model-id>", analyze_request=request, output_content_format="pdf" ) # Get the result result = poller.result() # Save the result as a PDF with open("output.pdf", "wb") as f: f.write(result)

For more detailed information, you can refer to the official Microsoft documentation on the DocumentIntelligenceClient class and its methods.

I hope this helps! Let me know if you have any other questions.

If the reply was helpful, please don't forget to upvote and/or accept as answer, this can be beneficial to other community members.

Thanks
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Share via

How to batch create a searchable pdf using azure document intelligence python API

1 answer

Your answer