Mounting cloud object storage on Azure Databricks

Artikkeli
08/29/2024

Important

Mounts are a legacy access pattern. Databricks recommends using Unity Catalog for managing all data access. See Connect to cloud object storage and services using Unity Catalog.

Azure Databricks enables users to mount cloud object storage to the Databricks File System (DBFS) to simplify data access patterns for users that are unfamiliar with cloud concepts. Mounted data does not work with Unity Catalog, and Databricks recommends migrating away from using mounts and instead managing data governance with Unity Catalog.

How does Azure Databricks mount cloud object storage?

Azure Databricks mounts create a link between a workspace and cloud object storage, which enables you to interact with cloud object storage using familiar file paths relative to the Databricks file system. Mounts work by creating a local alias under the /mnt directory that stores the following information:

Location of the cloud object storage.
Driver specifications to connect to the storage account or container.
Security credentials required to access the data.

What is the syntax for mounting storage?

The source specifies the URI of the object storage (and can optionally encode security credentials). The mount_point specifies the local path in the /mnt directory. Some object storage sources support an optional encryption_type argument. For some access patterns you can pass additional configuration specifications as a dictionary to extra_configs.

Note

Databricks recommends setting mount-specific Spark and Hadoop configuration as options using extra_configs. This ensures that configurations are tied to the mount rather than the cluster or session.

dbutils.fs.mount(
  source: str,
  mount_point: str,
  encryption_type: Optional[str] = "",
  extra_configs: Optional[dict[str:str]] = None
)

Check with your workspace and cloud administrators before configuring or altering data mounts, as improper configuration can provide unsecured access to all users in your workspace.

Note

In addition to the approaches described in this article, you can automate mounting a bucket with the Databricks Terraform provider and databricks_mount.

Unmount a mount point

To unmount a mount point, use the following command:

dbutils.fs.unmount("/mnt/<mount-name>")

Warning

To avoid errors, never modify a mount point while other jobs are reading or writing to it. After modifying a mount, always run dbutils.fs.refreshMounts() on all other running clusters to propagate any mount updates. See refreshMounts command (dbutils.fs.refreshMounts).

Mount ADLS Gen2 or Blob Storage with ABFS

You can mount data in an Azure storage account using a Microsoft Entra ID application service principal for authentication. For more information, see Access storage using a service principal & Microsoft Entra ID(Azure Active Directory).

Important

All users in the Azure Databricks workspace have access to the mounted ADLS Gen2 account. The service principal you use to access the ADLS Gen2 account should be granted access only to that ADLS Gen2 account; it should not be granted access to other Azure resources.
When you create a mount point through a cluster, cluster users can immediately access the mount point. To use the mount point in another running cluster, you must run dbutils.fs.refreshMounts() on that running cluster to make the newly created mount point available for use.
Unmounting a mount point while jobs are running can lead to errors. Ensure that production jobs do not unmount storage as part of processing.
Mount points that use secrets are not automatically refreshed. If mounted storage relies on a secret that is rotated, expires, or is deleted, errors can occur, such as 401 Unauthorized. To resolve such an error, you must unmount and remount the storage.
Hierarchical namespace (HNS) must be enabled to successfully mount an Azure Data Lake Storage Gen2 storage account using the ABFS endpoint.

Run the following in your notebook to authenticate and create a mount point.

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "<application-id>",
          "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

val configs = Map(
  "fs.azure.account.auth.type" -> "OAuth",
  "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id" -> "<application-id>",
  "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"),
  "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")
// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

Replace

<application-id> with the Application (client) ID for the Azure Active Directory application.
<scope-name> with the Databricks secret scope name.
<service-credential-key-name> with the name of the key containing the client secret.
<directory-id> with the Directory (tenant) ID for the Azure Active Directory application.
<container-name> with the name of a container in the ADLS Gen2 storage account.
<storage-account-name> with the ADLS Gen2 storage account name.
<mount-name> with the name of the intended mount point in DBFS.

Jaa

Mounting cloud object storage on Azure Databricks

How does Azure Databricks mount cloud object storage?

What is the syntax for mounting storage?

Unmount a mount point

Mount ADLS Gen2 or Blob Storage with ABFS

Palaute

Lisäresursseja