Connect to Azure Data Lake Storage Gen2 and Blob Storage

Note

This article describes legacy patterns for configuring access to Azure Data Lake Storage Gen2. Databricks recommends using Unity Catalog to configure access to Azure Data Lake Storage Gen2 and volumes for direct interaction with files. See Connect to cloud object storage and services using Unity Catalog.

This article explains how to connect to Azure Data Lake Storage Gen2 and Blob Storage from Azure Databricks.

Note

Connect to Azure Data Lake Storage Gen2 or Blob Storage using Azure credentials

The following credentials can be used to access Azure Data Lake Storage Gen2 or Blob Storage:

Databricks recommends using secret scopes for storing all credentials. You can grant users, service principals, and groups in your workspace access to read the secret scope. This protects the Azure credentials while allowing users to access Azure storage. To create a secret scope, see Manage secret scopes.

Set Spark properties to configure Azure credentials to access Azure storage

You can set Spark properties to configure a Azure credentials to access Azure storage. The credentials can be scoped to either a cluster or a notebook. Use both cluster access control and notebook access control together to protect access to Azure storage. See Compute permissions and Collaborate using Databricks notebooks.

Note

Microsoft Entra ID service principals can also be used to access Azure storage from a SQL warehouse, see Data access configurations.

To set Spark properties, use the following snippet in a cluster’s Spark configuration or a notebook:

Azure service principal

Use the following format to set the cluster Spark configuration:

spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<directory-id>/oauth2/token

You can use spark.conf.set in notebooks, as shown in the following example:

service_credential = dbutils.secrets.get(scope="<secret-scope>",key="<service-credential-key>")

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "OAuth")
spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net", "<application-id>")
spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net", service_credential)
spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")

Replace

  • <secret-scope> with the Databricks secret scope name.
  • <service-credential-key> with the name of the key containing the client secret.
  • <storage-account> with the name of the Azure storage account.
  • <application-id> with the Application (client) ID for the Microsoft Entra ID application.
  • <directory-id> with the Directory (tenant) ID for the Microsoft Entra ID application.

SAS tokens

You can configure SAS tokens for multiple storage accounts in the same Spark session.

spark.conf.set("fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net", "SAS")
spark.conf.set("fs.azure.sas.token.provider.type.<storage-account>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.sas.FixedSASTokenProvider")
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net", dbutils.secrets.get(scope="<scope>", key="<sas-token-key>"))

Replace

  • <storage-account> with the Azure Storage account name.
  • <scope> with the Azure Databricks secret scope name.
  • <sas-token-key> with the name of the key containing the Azure storage SAS token.

Account key

spark.conf.set(
    "fs.azure.account.key.<storage-account>.dfs.core.windows.net",
    dbutils.secrets.get(scope="<scope>", key="<storage-account-access-key>"))

Replace

  • <storage-account> with the Azure Storage account name.
  • <scope> with the Azure Databricks secret scope name.
  • <storage-account-access-key> with the name of the key containing the Azure storage account access key.

Access Azure storage

Once you have properly configured credentials to access your Azure storage container, you can interact with resources in the storage account using URIs. Databricks recommends using the abfss driver for greater security.

spark.read.load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")

dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-data>")
CREATE TABLE <database-name>.<table-name>;

COPY INTO <database-name>.<table-name>
FROM 'abfss://container@storageAccount.dfs.core.windows.net/path/to/folder'
FILEFORMAT = CSV
COPY_OPTIONS ('mergeSchema' = 'true');

Example notebook

ADLS Gen2 OAuth 2.0 with Microsoft Entra ID (formerly Azure Active Directory) service principals notebook

Get notebook

Azure Data Lake Storage Gen2 known issues

If you try accessing a storage container created through the Azure portal, you might receive the following error:

StatusCode=404
StatusDescription=The specified filesystem does not exist.
ErrorCode=FilesystemNotFound
ErrorMessage=The specified filesystem does not exist.

When a hierarchical namespace is enabled, you don’t need to create containers through Azure portal. If you see this issue, delete the Blob container through Azure portal. After a few minutes, you can access the container. Alternatively, you can change your abfss URI to use a different container, as long as this container is not created through Azure portal.

See Known issues with Azure Data Lake Storage Gen2 in the Microsoft documentation.

Deprecated patterns for storing and accessing data from Azure Databricks

The following are deprecated storage patterns: