How to read text from .doc and .docx files in Azure Blob Storage with Python?

11-4688 211 Reputation points
2024-10-28T16:47:05.13+00:00

Hello,

Is there a way to read the content of .docx files—and more importantly, .doc files—stored in Azure Blob Storage directly in Python without having to download them locally? Handling .doc files locally can be quite cumbersome, so I'm curious how Azure Blob Storage addresses this challenge.

Thank you!

Word
Word
A family of Microsoft word processing software products for creating web, email, and print documents.
908 questions
Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
3,294 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
3,003 questions
Word Management
Word Management
Word: A family of Microsoft word processing software products for creating web, email, and print documents.Management: The act or process of organizing, handling, directing or controlling something.
941 questions
{count} vote

2 answers

Sort by: Most helpful
  1. Sumarigo-MSFT 47,371 Reputation points Microsoft Employee
    2024-11-03T06:24:15.52+00:00

    @Adam Kupiec You can read the content of .docx and .doc files stored in Azure Blob Storage directly in Python without downloading them locally. This can be achieved by using the Azure Storage Blob SDK for Python, which allows you to stream the content of the files directly into memory.

    Here's a general approach to achieve this:

    Install the Azure Storage Blob SDK: You need to install the Azure Storage Blob SDK for Python. You can do this using pip:

    pip install azure-storage-blob
    

    Authenticate and Access the Blob Storage: Use the SDK to authenticate and access the blob storage container where your files are stored. Here's an example code snippet to read the content of a .docx file:

    from azure.storage.blob import BlobServiceClient
    from io import BytesIO
    import docx
    
    # Replace with your connection string and container name
    connection_string = "your_connection_string"
    container_name = "your_container_name"
    blob_name = "your_file.docx"
    
    # Create a BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    container_client = blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(blob_name)
    
    # Read the blob content into memory
    stream = BytesIO()
    blob_client.download_blob().readinto(stream)
    stream.seek(0)
    
    # Use python-docx to read the content of the .docx file
    doc = docx.Document(stream)
    for paragraph in doc.paragraphs:
        print(paragraph.text)
    

    Handling .doc Files: For .doc files, you can use the python-docx library to read .docx files, but for .doc files, you might need to use the pywin32 library or other libraries like pypandoc to convert .doc files to .docx format before reading them.

    This approach allows you to read the content of the files directly from Azure Blob Storage into memory, avoiding the need to download them locally.

    References

    Access azure blob storage files with python without downloading

    Read Big Azure Blob Storage file – Best practices with examples

    Please let us know if you have any further queries. I’m happy to assist you further.    


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    1 person found this answer helpful.

  2. Keshavulu Dasari 2,420 Reputation points Microsoft Vendor
    2024-12-06T04:17:05.3633333+00:00

    Hi swarmttied,
    Apologies for the delayed response, "file is not a valid zip file" error typically occurs because .docx files are essentially zip files containing XML and other resources. If the file structure is corrupted or not properly saved, this error can arise
    When opening the file, make sure to open it in binary mode. This is crucial for reading .docx files correctly. Ensure that you are using the latest versions of python-docx and other related libraries. Sometimes, bugs are fixed in newer releases.
    If the issue persists, you can try using the zipfile module to manually check the contents of the .docx file
    Ensure that you are correctly using BytesIO when reading from Azure Blob Storage, If the problem persists, it might be worth checking the specific version of Word you are using and any potential compatibility issues with python-docx .


    If you are still facing issues, please provide some screenshots and details in the "comments" and I would be happy to help you. Thank you again for your time and patience throughout this issue.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.