How to read text from .doc and .docx files in Azure Blob Storage with Python?

11-4688 206 Reputation points
2024-10-28T16:47:05.13+00:00

Hello,

Is there a way to read the content of .docx files—and more importantly, .doc files—stored in Azure Blob Storage directly in Python without having to download them locally? Handling .doc files locally can be quite cumbersome, so I'm curious how Azure Blob Storage addresses this challenge.

Thank you!

Word
Word
A family of Microsoft word processing software products for creating web, email, and print documents.
890 questions
Azure Storage Accounts
Azure Storage Accounts
Globally unique resources that provide access to data management services and serve as the parent namespace for the services.
3,244 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,943 questions
Word Management
Word Management
Word: A family of Microsoft word processing software products for creating web, email, and print documents.Management: The act or process of organizing, handling, directing or controlling something.
936 questions
{count} vote

1 answer

Sort by: Most helpful
  1. Sumarigo-MSFT 47,106 Reputation points Microsoft Employee
    2024-11-03T06:24:15.52+00:00

    @Adam Kupiec You can read the content of .docx and .doc files stored in Azure Blob Storage directly in Python without downloading them locally. This can be achieved by using the Azure Storage Blob SDK for Python, which allows you to stream the content of the files directly into memory.

    Here's a general approach to achieve this:

    Install the Azure Storage Blob SDK: You need to install the Azure Storage Blob SDK for Python. You can do this using pip:

    pip install azure-storage-blob
    

    Authenticate and Access the Blob Storage: Use the SDK to authenticate and access the blob storage container where your files are stored. Here's an example code snippet to read the content of a .docx file:

    from azure.storage.blob import BlobServiceClient
    from io import BytesIO
    import docx
    
    # Replace with your connection string and container name
    connection_string = "your_connection_string"
    container_name = "your_container_name"
    blob_name = "your_file.docx"
    
    # Create a BlobServiceClient
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)
    container_client = blob_service_client.get_container_client(container_name)
    blob_client = container_client.get_blob_client(blob_name)
    
    # Read the blob content into memory
    stream = BytesIO()
    blob_client.download_blob().readinto(stream)
    stream.seek(0)
    
    # Use python-docx to read the content of the .docx file
    doc = docx.Document(stream)
    for paragraph in doc.paragraphs:
        print(paragraph.text)
    

    Handling .doc Files: For .doc files, you can use the python-docx library to read .docx files, but for .doc files, you might need to use the pywin32 library or other libraries like pypandoc to convert .doc files to .docx format before reading them.

    This approach allows you to read the content of the files directly from Azure Blob Storage into memory, avoiding the need to download them locally.

    References

    Access azure blob storage files with python without downloading

    Read Big Azure Blob Storage file – Best practices with examples

    Please let us know if you have any further queries. I’m happy to assist you further.    


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.