No Support for reading zip files store in Azure Datalake storage using abfss protocol url via python notebook.

Jeevesh 0 Reputation points
2025-03-03T11:08:21.7866667+00:00

Hi,

This is so strange that microsoft have created the abfss protocol but have zero support to read zip files using abfss protocol via python notebook.

For example I have this sample code that gives me hard time identifying a zip file as a zip file:

import zipfile

from notebookutils import mssparkutils

rawDirectory = "myRawDir"

azureBlobUpload = "abfss://myuploadcontainer@

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,355 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amira Bedhiafi 29,481 Reputation points
    2025-03-03T13:51:24.4533333+00:00

    Verify that your Azure Databricks workspace has the necessary permissions to access the Azure Data Lake Storage Gen2 account. You can use a service principal or managed identity for authentication.

    Then mount the storage:

    
    configs = {
    
      "fs.azure.account.auth.type": "OAuth",
    
      "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    
      "fs.azure.account.oauth2.client.id": "<client-id>",
    
      "fs.azure.account.oauth2.client.secret": "<client-secret>",
    
      "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
    
    }
    
    
    
    dbutils.fs.mount(
    
      source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
    
      mount_point = "/mnt/<mount-name>",
    
      extra_configs = configs)
    

    Since you cannot directly read a ZIP file from abfss, you can download it to the local file system and then process it.

    
    zip_file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-zip-file>"
    
    
    local_zip_path = "/dbfs/tmp/<zip-file-name>.zip"
    
    
    dbutils.fs.cp(zip_file_path, "file:" + local_zip_path)
    

    Now that the ZIP file is in the local file system, you can use zipfile module to extract and read its contents.

    
    import zipfile
    
    
    with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
    
    
        file_list = zip_ref.namelist()
    
        print("Files in ZIP archive:", file_list)
    
    
        zip_ref.extractall("/dbfs/tmp/extracted_files/")
    
        
    
        with zip_ref.open(file_list[0]) as file:
    
            content = file.read()
    
            print("Content of the first file:", content)
    
    0 comments No comments

  2. J N S S Kasyap 685 Reputation points Microsoft External Staff
    2025-03-04T22:48:22.9666667+00:00

    @Jeevesh

    I appreciate you sharing your concerns and taking the time to reach out. It can be frustrating when expected functionalities don’t work seamlessly, especially in an ecosystem designed to integrate multiple services.
    The challenge here is that Python’s zipfile module does not natively support the abfss:// protocol, as it is designed to work with local file paths or file-like objects. Since abfss:// is a cloud-based storage protocol, direct access is not possible without an intermediary step.
    You can still work with zip files in Azure Data Lake Storage by first reading them into memory using mssparkutils.fs.copy() and then processing them with zipfile.

    Here’s a way to achieve this in an Azure Synapse notebook:

    from notebookutils import mssparkutils
    import zipfile
    import os
    # Define the path to the zip file in ADLS
    zip_path = "abfss://myuploadcontainer@<storage_account>.dfs.core.windows.net/myzipfile.zip"
    # Define a temporary local path to copy the zip file
    local_path = "/tmp/myzipfile.zip"
    # Ensure the temp directory exists
    os.makedirs("/tmp", exist_ok=True)
    # Copy the zip file from ADLS to local storage
    mssparkutils.fs.copy(zip_path, f"file://{local_path}", True)
    # Open the copied zip file using zipfile
    with zipfile.ZipFile(local_path, 'r') as zip_ref:
        # Extract all contents to a local directory
        zip_ref.extractall("/tmp/extracted")
        # List the contents of the zip file
        print(zip_ref.namelist())
    

    Please refer the below thread
    https://learn.microsoft.com/en-us/answers/questions/2116923/how-to-read-zip-file-in-azure-synapse-notebook-wit
    Introduction to Microsoft Spark utilities - Azure Synapse Analytics | Microsoft Learn
    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.