Κοινή χρήση μέσω


Hive metastore federation: enable Unity Catalog to govern tables registered in a Hive metastore

Important

This feature is in Public Preview.

This article introduces Hive metastore federation, a feature that enables Unity Catalog to govern tables that are stored in a Hive metastore. You can federate an external Hive metastore or a legacy internal Azure Databricks Hive metastore.

Hive metastore federation can be used for the following use cases:

  • As a step in the migration path to Unity Catalog, enabling incremental migration without code adaptation, with some of your workloads continuing to use data registered in your Hive metastore while others are migrated.

    This use case is most suited for organizations that use a legacy internal Azure Databricks Hive metastore today, because federated internal Hive metastores allow both read and write workloads.

  • To provide a longer-term hybrid model for organizations that must maintain some data in a Hive metastore alongside their data that is registered in Unity Catalog.

    This use case is most suited for organizations that use an external Hive metastore, because federated catalogs for these Hive metastores are read-only.

Diagram that introduces Hive federation

Overview of Hive metastore federation

In Hive metastore federation, you create a connection from your Azure Databricks workspace to your Hive metastore, and Unity Catalog crawls the Hive metastore to populate a federated catalog that enables your organization to work with your Hive metastore tables in Unity Catalog, providing centralized access controls, lineage, search, and more.

Federated Hive metastores that are external to your Azure Databricks workspace allow reads using Unity Catalog. Internal Hive metastores allow reads and writes, updating the Hive metastore metadata as well as the Unity Catalog metadata when you write.

When you query federated Hive metastore assets, Unity Catalog provides the governance layer, performing functions such as access control checks and auditing, while queries are executed using Hive metastore semantics. For example, if a user queries a table stored in Parquet format in a federated catalog, then:

  • Unity Catalog checks if the user has access to the table and infers lineage for the query.
  • The query itself runs against the underlying Hive metastore, leveraging the latest metadata and partition information stored there.

Diagram that shows the relationship between the HMS, Unity Catalog, and Databricks workloads in a Hive federation scenario

How does Hive metastore federation compare to using Unity Catalog external tables?

Unity Catalog has the ability to create external tables, taking data that already exists in an arbitrary cloud storage location and registering it in Unity Catalog as a table. This section explores the differences between external and federated Hive metastore tables.

Both table types have the following properties:

  • Can be used to register an arbitrary location in cloud storage as a table.
  • Can apply Unity Catalog permissions and fine-grained access controls.
  • Can be viewed in lineage for queries that reference them.

Only federated tables have the following properties:

  • Are automatically discovered based on crawling a Hive metastore. As soon as tables are created in the Hive metastore, they are surfaced and available to query in the Unity Catalog federated catalog.
  • Allow tables to be defined with Hive semantics such as Hive SerDes and partitions.
  • Allow tables to have overlapping paths with other tables in federated catalogs.
  • Allow tables to be located in DBFS root locations.
  • Include views that are defined in Hive metastore.

In this way you can think of federated Hive metastore tables as offering backwards compatibility with Hive metastore, allowing workloads to use Hive-only semantics but with governance provided by Unity Catalog.

However, some Unity Catalog features are not available on federated tables, for example:

  • Features available only for Unity Catalog managed tables, such as predictive optimization.
  • Vector search, Delta Sharing, Lakehouse monitoring, and online tables.
  • Some feature store functionality, including feature store creation, model serving creation, feature spec creation, model logging and batch scoring.

Performance can be marginally worse than workloads on Unity Catalog or Hive metastore because both Hive metastore and Unity Catalog are on the query path of a federated table.

For more information about supported functionality, see Requirements, supported features, and limitations.

What does it mean to write to a federated Hive metastore catalog in Azure Databricks?

Writes are supported only for federated internal Hive metastores, not external Hive metastores.

Writes to federated metastores are of two types:

  • DDL operations such as CREATE TABLE, ALTER TABLE, and DROP TABLE.

    DDL operations are synchronously reflected in the underlying Hive metastore. For example, running a CREATE TABLE statement creates the table in both the Hive metastore and the federated catalog.

    Warning

    This also means that DROP commands are reflected in the Hive metastore. For example, DROP SCHEMA mySchema CASCADE drops all tables in the underlying Hive metastore schema, without the option to UNDROP, because Hive metastore does not support UNDROP.

  • DML operations such as INSERT, UPDATE, and DELETE.

    DML operations are also synchronously reflected in the underlying Hive metastore table. For example, running INSERT INTO adds records to the table in the Hive metastore.

    Write support is a key to enabling a seamless transition during migration from Hive metastore to Unity Catalog. See How do you use Hive metastore federation during migration to Unity Catalog?.

How do you set up Hive metastore federation?

To set up Hive metastore federation, you do the following:

  1. Create a connection in Unity Catalog that specifies the path and credentials for accessing the Hive metastore.

    Hive metastore federation uses this connection to crawl the Hive metastore. For most database systems, you supply a username and password. For a connection to a legacy internal Azure Databricks workspace Hive metastore, Hive metastore federation takes care of authorization.

  2. Create a storage credential and an external location in Unity Catalog for the paths to the tables registered in the Hive metastore.

    External locations contain paths and the storage credentials required to access those paths. Storage credentials are Unity Catalog securable objects that specify credentials, such as Azure managed identities, for access to cloud storage. Depending on the workflow you choose for creating external locations, you might have to create storage credentials before you create the external location.

  3. Create a federated catalog in Unity Catalog, using the connection that you created in step 1.

    This is the catalog that workspace users and workflows use to work with Hive metastore tables using Unity Catalog. After you’ve created the federated catalog, Unity Catalog populates it with the tables registered in the Hive metastore.

  4. Grant privileges to the tables in the federated catalog using Unity Catalog.

    You can also use Unity Catalog row and column filters for fine-grained access control.

  5. Start querying data.

    Access to federated data using Unity Catalog is read-only for external Hive metastores and read-and-write for internal Hive metastores.

    For internal Hive metastores and external Hive metastores, Unity Catalog continuously updates table metadata as it changes in the Hive metastore. For internal Hive metastores, new tables and table updates committed from the federated catalog are written back to the Hive metastore, maintaining full interoperability between the Unity Catalog and Hive metastore catalogs.

For detailed instructions, see:

How do you use Hive metastore federation during migration to Unity Catalog?

Hive metastore federation lets you migrate to Unity Catalog incrementally by reducing the need for coordination between teams and workloads. In particular, if you are migrating from your Azure Databricks workspace’s internal Hive metastore, the ability to read from and write to both the Hive metastore and the Unity Catalog metastore means that you can maintain “mirrored” metastores during your migration, providing the following benefits:

  • Workloads that run against federated catalogs run in Hive metastore compatibility mode, reducing the cost of code adaptation during migration.
  • Each workload can choose to migrate independently of others, knowing that, during the migration period, data will be available in both Hive metastore and Unity Catalog, alleviating the need to coordinate between workloads that have dependencies on one another.

Diagram that gives overview of HMS federation in the context of migration

This section describes a typical workflow for migrating a Azure Databricks workspace’s internal legacy Hive metastore to Unity Catalog, with Hive metastore federation easing the transition. It does not apply to migrating an external Hive metastore. Federated catalogs for external Hive metastores do not support writes.

Step 1: Federate the internal Hive metastore

In this step, you create a federated catalog that mirrors your Hive metastore in Unity Catalog. Let’s call it hms_in_uc.

Diagram that shows workloads running on the Hive metastore and the existence of the mirrored Unity Catalog federated catalog, hms_in_uc

Note

As part of the federation process, you set up external locations to provide access to the data in cloud storage. In migration scenarios in which some workloads are querying the data using legacy access mechanisms and other workloads are querying the same data in Unity Catalog, the Unity Catalog-managed access controls on external locations can prevent the legacy workloads from accessing the paths to storage from Unity Catalog-enabled compute. You can enable “fallback mode” on these external locations to fall back on any cluster- or notebook-scoped credentials that were defined for the legacy workload. Then when your migration is done, you turn fallback mode off. See What is fallback mode?.

For details, see Enable Hive metastore federation for a legacy workspace Hive metastore.

Step 2. Run new workloads against the federated catalog in Unity Catalog

When you have a federated catalog in place, you can grant SQL analysts and data science consumers access to it and start developing new workloads that point to it. The new workloads benefit from the additional feature set in Unity Catalog, including access controls, search, and lineage.

Diagram that shows existing workloads running on the Hive metastore and new workloads running on the mirrored Unity Catalog federated catalog, hms_in_uc

In this step, you typically do the following:

  • Choose Unity Catalog-compatible compute (that is, single-user or shared cluster access modes, SQL warehouses, or serverless compute). See Requirements, supported features, and limitations.
  • Make the federated catalog the default catalog on the compute resource or add USE CATALOG hms_in_uc to the top of your code. Because schemas and table names in the federated catalog are exact mirrors of those in the Hive metastore, your code will start referring to the federated catalog.

Step 3. Migrate existing jobs to run against the federated catalog

To migrate existing jobs to query the federated catalog:

  1. Change the default catalog on the job cluster to be hms_in_uc, either by setting a property on the cluster itself or by adding USE CATALOG hms_in_uc at the top of your code.
  2. Switch the job to single-user or shared access mode compute and upgrade to one of the Databricks Runtime versions that supports Hive metastore federation. See Requirements, supported features, and limitations.
  3. Ask an Azure Databricks admin to grant the correct Unity Catalog privileges on the data objects in hms_in_uc and on any cloud storage paths (included in Unity Catalog external locations) that the job accesses. See Manage privileges in Unity Catalog.

Second instance of the diagram that gives an overview of HMS federation in the context of migration

Step 4. Deny access to the Hive metastore

Once you’ve migrated all of your workloads to query the federated catalog, you no longer need the Hive metastore. You can use legacy table access controls and compute permissions to block direct access from your Azure Databricks workspace to the Hive metastore. For example, you can:

  1. Revoke all privileges on the objects in the Hive metastore catalog.

    The MSCK REPAIR PRIVILEGES command is convenient for this purpose. See MSCK REPAIR PRIVILEGES and Hive metastore privileges and securable objects (legacy).

  2. Prevent users from creating and using clusters that bypass table access control (clusters that use no isolation shared access mode or a legacy custom cluster type) using compute policies.

    See Manage compute configurations.

  3. Make the federated catalog the workspace default catalog.

    See Manage the default catalog.

Frequently asked questions

The following sections provide more detailed information about Hive metastore federation.

What is fallback mode?

Fallback mode is a setting on external locations that you can use to bypass Unity Catalog permission checks during migration to Unity Catalog. Setting it ensures that workloads that haven’t yet been migrated are not impacted during the setup phase.

Unity Catalog gains access to cloud storage using external locations, which are securable objects that define a path and a credential to access your cloud storage account. You can issue permissions on them, like READ FILES, to govern who can use the path. One challenge during the migration process is that you might not want Unity Catalog to start governing all access to the path immediately, for example, when you have existing, unmigrated workloads that reference the path.

Fallback mode allows you to delay the strict enforcement Unity Catalog access control on external locations. When fallback mode is enabled, workloads that access a path are first checked against Unity Catalog permissions, and if they fail, fall back to using cluster- or notebook-scoped credentials, such as instance profiles or Apache Spark configuration properties. This allows existing workloads to continue using their current credential.

Fallback mode is intended only for use during migration. You should turn it off when all workloads have been migrated and you are ready to enforce Unity Catalog access controls.

Query audit log for fallback usage

Use the following query to check if any access to the external location used fallback mode in the last 30 days. If there is no fallback-mode access in your account, Databricks recommends turning off fallback mode.

SELECT event_time, user_identity, action_name, request_params, response, identity_metadata
FROM system.access.audit
WHERE
request_params.fallback_enabled = 'true' AND
request_params.path LIKE '%some-path%' AND
event_time >= current_date() - INTERVAL 30 DAYS
LIMIT 10

What are authorized paths?

When you create a federated catalog, you are prompted to provide authorized paths to the cloud storage where the Hive metastore tables are stored. Any table that you want to access using Hive metastore federation must be covered by these paths. Databricks recommends that your authorized paths be sub-paths that are common across a large number of tables. For example, if you have tables at abfss://container@storageaccount.dfs.core.windows.net/bucket/table1, ./bucket/table2, and ./bucket/table3, you should provide abfss://container@storageaccount.dfs.core.windows.net/bucket/ as an authorized path.

You can use UCX to help you identify the paths that are present in your Hive metastore.

Authorized paths add an extra layer of security on federated catalogs by enabling the catalog owner to apply guardrails to the data that users can access using federation. This is useful if your Hive metastore allows users to update metadata and arbitrarily alter table locations—updates that would otherwise be synchronized into the federated catalog. In this scenario, users could potentially redefine tables that they already have access to so that they point to new locations that they would otherwise not have access to.

Can I federate Hive metastores using UCX?

UCX, the Databricks Labs project for migrating Azure Databricks workspaces to Unity Catalog, includes utilities for enabling Hive metastore federation:

  • enable-hms-federation
  • create-federated-catalog

See the project readme in GitHub. For an introduction to UCX, see Use the UCX utilities to upgrade your workspace to Unity Catalog.

Requirements, supported features, and limitations

The following table lists the services and features that are supported by Hive metastore federation. In some cases, unsupported services or features are also listed. In these tables, “HMS” stands for Hive metastore.

Category Supported Not supported
Metastores - Legacy workspace Hive metastores (internal to Databricks)
- External metastores on Apache Hive version 0.13 or 2.3 using mySQL
- External metastores in databases other than mySQL
- Hive 3.1
Operations - Internal Databricks HMS: reads and writes
- External HMS: reads only
Hive metastore data assets - Managed and external tables in Hive metastore
- Schemas
- Views
- Hive SerDe tables
- Hive functions and UDFs
- Defining new shallow clones in the federated catalog
- JDBC-backed tables
- Delta Sharing shared tables
- Accessing shallow clones registered in the Hive metastore through the federated catalog
Storage - Azure Data Lake Storage Gen2
- Tables that reference DBFS mount locations, including DBFS root
- Tables whose paths overlap with other HMS table paths defined in external locations
- HMS tables whose paths overlap with native Unity Catalog object paths
- Access to tables in DBFS root or mount locations registered in an external HMS
- Access to tables in DBFS root or mount locations from any workspace other than the one in which the internal HMS is defined
- Firewall support for the workspace storage account
Compute types - Shared clusters
- Single user (assigned) clusters
- Serverless (all)
- SQL warehouses (all)
No isolation clusters
Compute versions - All Databricks SQL channels
- All Delta Live Tables channels
- Databricks Runtime 13.3 LTS
- Databricks Runtime 14.3 LTS
- Databricks Runtime 15.1 and above
Unity Catalog features - Unity Catalog privilege model
- Row filters and column masks
- Auditing
- Downstream lineage
- Table search
- Cross-workspace access (except DBFS root and mounts)
- Data access limited to defined external locations
- Delta Sharing
- Lakehouse monitoring
- Vector search
- Online tables
- Some feature store functionality, including feature store creation, model serving creation, feature spec creation, model logging and batch scoring
- You cannot write Delta Live Tables materialized views and streaming tables into a federated catalog, but you can use federated assets as a source for Delta Live Tables materialized views and streaming tables.
- Auto-migration of legacy table ACLs to Unity Catalog privileges for the federated catalog. UCX can help with this.