Verify that your Azure Databricks workspace has the necessary permissions to access the Azure Data Lake Storage Gen2 account. You can use a service principal or managed identity for authentication.
Then mount the storage:
configs = {
"fs.azure.account.auth.type": "OAuth",
"fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
"fs.azure.account.oauth2.client.id": "<client-id>",
"fs.azure.account.oauth2.client.secret": "<client-secret>",
"fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<tenant-id>/oauth2/token"
}
dbutils.fs.mount(
source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
mount_point = "/mnt/<mount-name>",
extra_configs = configs)
Since you cannot directly read a ZIP file from abfss
, you can download it to the local file system and then process it.
zip_file_path = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-zip-file>"
local_zip_path = "/dbfs/tmp/<zip-file-name>.zip"
dbutils.fs.cp(zip_file_path, "file:" + local_zip_path)
Now that the ZIP file is in the local file system, you can use zipfile
module to extract and read its contents.
import zipfile
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
file_list = zip_ref.namelist()
print("Files in ZIP archive:", file_list)
zip_ref.extractall("/dbfs/tmp/extracted_files/")
with zip_ref.open(file_list[0]) as file:
content = file.read()
print("Content of the first file:", content)