Jaa


Best practices for DBFS and Unity Catalog

Unity Catalog introduces a number of new configurations and concepts that approach data governance entirely differently than DBFS. This article outlines several best practices around working with Unity Catalog external locations and DBFS.

Databricks recommends against using DBFS and mounted cloud object storage for most use cases in Unity Catalog-enabled Azure Databricks workspaces. This article describes a few scenarios in which you should use mounted cloud object storage. Note that Databricks does not recommend using the DBFS root in conjunction with Unity Catalog, unless you must migrate files or data stored there into Unity Catalog.

How is DBFS used in Unity Catalog-enabled workspaces?

Actions performed against tables in the hive_metastore use legacy data access patterns, which might include data and storage credentials managed by DBFS. Managed tables in the workspace-scoped hive_metastore are stored on the DBFS root.

How does DBFS work in single user access mode?

Clusters configured with single user access mode have full access to DBFS, including all files in the DBFS root and mounted data.

How does DBFS work in shared access mode?

Shared access mode combines Unity Catalog data governance with Azure Databricks legacy table ACLs. Access to data in the hive_metastore is only available to users that have permissions explicitly granted.

To interact with files directly using DBFS, you must have ANY FILE permissions granted. Because ANY FILE allows users to bypass legacy tables ACLs in the hive_metastore and access all data managed by DBFS, Databricks recommends caution when granting this privilege.

Do not use DBFS with Unity Catalog external locations

Unity Catalog secures access to data in external locations by using full cloud URI paths to identify grants on managed object storage directories. DBFS mounts use an entirely different data access model that bypasses Unity Catalog entirely. Databricks recommends that you do not reuse cloud object storage volumes between DBFS mounts and UC external volumes, including when sharing data across workspaces or accounts.

Secure your Unity Catalog-managed storage

Unity Catalog using managed storage locations for storing data files for managed tables and volumes.

Databricks recommends the following for managed storage locations:

  • Use new storage accounts or buckets.
  • Define a custom identity policy for Unity Catalog.
  • Restrict all access to Azure Databricks managed by Unity Catalog.
  • Restrict all access to identity access policies created for Unity Catalog.

Add existing data to external locations

It is possible to load existing storage accounts into Unity Catalog using external locations. For greatest security, Databricks recommends only loading storage accounts to external locations after revoking all other storage credentials and access patterns.

You should never load a storage account used as a DBFS root as an external location in Unity Catalog.

Cluster configurations are ignored by Unity Catalog filesystem access

Unity Catalog does not respect cluster configurations for filesystem settings. This means that Hadoop filesystem settings for configuring custom behavior with cloud object storage do not work when accessing data using Unity Catalog.

Limitation around multiple path access

While you can generally use Unity Catalog and DBFS together, paths that are equal or share a parent/child relationship cannot be referenced in the same command or notebook cell using different access methods.

For example, if an external table foo is defined in the hive_metastore at location a/b/c and an external location is defined in Unity Catalog on a/b/, the following code would throw an error:

spark.read.table("foo").filter("id IS NOT NULL").write.mode("overwrite").save("a/b/c")

This error would not arise if this logic is broken into two cells:

df = spark.read.table("foo").filter("id IS NOT NULL")
df.write.mode("overwrite").save("a/b/c")