Database objects in the legacy Hive metastore

Artikkeli
11/07/2024

The Azure Databricks documentation focuses on working with data objects using Unity Catalog, but most of the instructions also apply to working with objects registered in the legacy Hive metastore.

This article focuses on how to work with database objects registered in the legacy Hive metastore. Specifically, this article calls out where working with Hive metastore objects differs from working with Unity Catalog objects. It also describes other behaviors that might be unexpected.

Databricks recommends that you migrate all data from the legacy Hive metastore to Unity Catalog. See Upgrade Hive tables and views to Unity Catalog.

How does Hive metastore data governance work?

Although Azure Databricks workspaces continue to include the built-in Hive metastore, data governance using Hive metastore is deprecated. Databricks recommends that you use Unity Catalog for all data governance. See Work with Unity Catalog and the legacy Hive metastore.

Enabling a workspace for Unity Catalog does not reduce your ability to work with data already registered in Hive metastore. All data objects registered in the legacy Hive metastore are surfaced in Unity Catalog interfaces in the hive_metastore catalog. A hybrid Hive metastore and Unity Catalog workspace can be a useful model for transitioning a longstanding Hive metastore workspace. However, the data governance and performance advantages of Unity Catalog are high, and you should fully transition your workspaces as soon as you can.

Hive metastore uses table access control (table ACLs) to manage access to database objects. Some support remains for table access control when you use compute in shared access mode. See Hive metastore table access control (legacy).

Credential passthrough is a deprecated pattern for data governance on Hive metastore database objects. This article does not address credential passthrough. See Credential passthrough (legacy).

Note

Where this article refers to data access control in Hive metastore, it is referring to legacy table access control.

What is the `hive_metastore` catalog?

In a workspace that is enabled for Unity Catalog, all schemas in the Hive metastore appear as children of the hive_metastore catalog in the Unity Catalog three-level namespace. Hive metastore does not actually use catalogs, and this construct provides an entry point to tables in the legacy Hive metastore for Unity Catalog users. Use the following syntax to query tables in the legacy Hive metastore:

SELECT * FROM hive_metastore.schema_name.table_name

Note

You can optionally set the hive_metastore catalog as the workspace default in Unity Catalog-enabled workspaces. See Manage the default catalog.

Schemas in Hive metastore

In the legacy Hive metastore, a schema is the highest level in the data object hierarchy.

There are some important differences between Unity Catalog and Hive metastore, including the following:

You cannot create schemas in the Hive metastore using Catalog Explorer. You can view and edit permissions for schemas.
Schemas created in the Hive metastore can use only alphanumeric ASCII characters and underscores in their names.
Hive metastore allows you to declare a LOCATION for a schema during creation. This functions similarly to Unity Catalog managed storage locations, with the following behavioral differences:
- If you don’t provide a location, the default location /user/hive/warehouse/<schema-name> is used. This location is on the DBFS root, which is not recommended for storing any production data.
- The provided path can be any cloud storage location available to the user that creates the schema, including cloud URIs, DBFS root, and DBFS mounts.
- Access to the location is not managed by Hive metastore.
- Deleting a schema in the Hive metastore causes all files in that schema location to be deleted recursively, regardless of the table type (managed or external).

To avoid accidental data loss, Databricks recommends the following when you work with Hive metastore schema locations:

Don’t assign a schema location that already contains data.
Don’t create an external table in a schema location.
Don’t share a location among multiple schemas.
Don’t assign a schema location that overlaps another schema location. In other words, don’t use a path that is a child of another schema location.
Don’t assign a schema location that overlaps the location of an external table.

Managed tables in Hive metastore

Managed tables in Hive metastore do not have any of the performance benefits of managed tables in Unity Catalog. Like Unity Catalog managed tables, Hive metastore managed tables use Delta Lake by default. However, in Hive metastore, unlike Unity Catalog, you can also create a managed table using most of the other data formats supported by Azure Databricks.

Managed tables in Hive metastore are always created in the storage location of the containing schema. The compute that you use to query a managed table must have access to the storage location.

Hive metastore does not manage the data layout of managed tables the way that Unity Catalog does. When you drop a managed table in Hive metastore, all underlying data files are deleted immediately. In Unity Catalog, on the other hand, you can UNDROP a managed table for 7 days, and data is permanently deleted within 30 days.

You can use path-based access to read or write data in Hive metastore managed tables, while in Unity Catalog you cannot and do not need to.

External tables in Hive metastore

Most tables created in Azure Databricks before the introduction of Unity Catalog were configured as external tables in the Hive metastore. Legacy recommendations that favored external tables usually focused on a few key aspects:

You could register an external table on top of existing data in cloud object storage.
You could directly access data files in external tables from external systems for reads or writes.
Data files were not deleted if the table was dropped accidentally.
Because external tables require a LOCATION, production data was less likely to accidentally end up in the DBFS root.

Azure Databricks now recommends Unity Catalog managed tables for most tabular data storage. See Work with managed tables.

Views in Hive metastore

You can declare a view in Hive metastore backed by any data source supported by Azure Databricks. In Unity Catalog, you can only declare views against Unity Catalog tables and views, including foreign tables, materialized views, and Delta Sharing tables.

Because of the ability to declare views against non-tabular data sources, views in Hive metastore can granted unexpected or unintended access to data in combination with other access configurations in the user environment.

For example, consider the following:

A table my_table is defined using the DBFS mount path /mnt/my_table.
- DBFS mount credentials are stored in the workspace, so all users have access to this path by default.
Table ACLs are used to restrict access to my_table to a group of users.
- Legacy table ACLs only apply on compute congfigured with shared access mode or SQL warehouses.
A view my_view is defined directly against the cloud URI backing the same data files 'abfss://container-name@storage-account-name.dfs.core.windows.net/my_table'.
- URI credentials rely on access policies defined in the Spark session or compute configuration.

The view my_view has the following properties:

It does not use the DBFS mount credentials used to mount cloud object storage to /mnt/my_table.
It does not respect the table ACLs set on my_table, regardless of compute configurations.
It requires a data access policy configured for compute that provides read access to 'abfss://container-name@storage-account-name.dfs.core.windows.net/my_table'.

Note

This is one example of unexpected behavior you might encounter, and not comprehensive off all the potential pitfalls presented by views in legacy Hive metastore. Databricks recommends using Unity Catalog for all view definitions.

Legacy Hive tables and HiveQL support

Azure Databricks includes some legacy support for Hive tables and HiveQL functionality. This functionality is a remnant of early versions of Azure Databricks and the Apache Hadoop ecosystem of tools. Databricks does not recommend using Hive tables or other Hive functionality, as this functionality is not optimized and lacks support in some compute configurations.

The following articles describe legacy Hive functionality:

Jaa

Database objects in the legacy Hive metastore

How does Hive metastore data governance work?

What is the `hive_metastore` catalog?

Schemas in Hive metastore

Managed tables in Hive metastore

External tables in Hive metastore

Views in Hive metastore

Legacy Hive tables and HiveQL support

Palaute

Lisäresursseja

Jaa

Database objects in the legacy Hive metastore

How does Hive metastore data governance work?

What is the hive_metastore catalog?

Schemas in Hive metastore

Managed tables in Hive metastore

External tables in Hive metastore

Views in Hive metastore

Legacy Hive tables and HiveQL support

Palaute

Lisäresursseja

What is the `hive_metastore` catalog?