แชร์ผ่าน


Use Azure managed identities in Unity Catalog to access storage

This article describes how to use Azure managed identities for connecting to storage containers on behalf of Unity Catalog users.

What are Azure managed identities?

Unity Catalog can be configured to use an Azure managed identity to access storage containers on behalf of Unity Catalog users. Managed identities provide an identity for applications to use when they connect to resources that support Microsoft Entra ID authentication.

You can use managed identities in Unity Catalog to support two primary use cases:

  • As an identity to connect to the metastore’s managed storage accounts (where managed tables are stored).
  • As an identity to connect to other external storage accounts (either for file-based access or for accessing existing datasets through external tables).

Configuring Unity Catalog with a managed identity has the following benefits over configuring Unity Catalog with a service principal:

Configure a managed identity for Unity Catalog

To configure a managed identity to use with Unity Catalog, you first create an access connector for Azure Databricks in Azure. By default, the access connector will deploy with a system-assigned managed identity. You can choose instead to attach a user-assigned managed identity. You then grant the managed identity access to your Azure Data Lake Storage Gen2 account and use the access connector when you create a Unity Catalog metastore or storage credential.

Requirements

The Azure user or service principal who creates the access connector must:

  • Be a Contributor or Owner of an Azure resource group.

The Azure user or service principal who grants the managed identity to the storage account must:

  • Be an Owner or a user with the User Access Administrator Azure RBAC role on the storage account.

Step 1: Create an access connector for Azure Databricks

The Access Connector for Azure Databricks is a first-party Azure resource that lets you connect managed identities to an Azure Databricks account.

Each access connector for Azure Databricks can contain either one system-assigned managed identity or one user-assigned managed identity. If you want to use multiple managed identities, create a separate access connector for each.

Use a system-assigned managed identity

  1. Log in to the Azure Portal as a Contributor or Owner of a resource group.

  2. Click + Create or Create a new resource.

  3. Search for Access Connector for Azure Databricks and select it.

  4. Click Create.

  5. On the Basics tab, accept, select, or enter values for the following fields:

    • Subscription: This is the Azure subscription that the access connector will be created in. The default is the Azure subscription you are currently using. It can be any subscription in the tenant.
    • Resource group: This is the Azure resource group that the access connector will be created in.
    • Name: Enter a name that indicates the purpose of the connector.
    • Region: This should be the same region as the storage account that you will connect to.
  6. Click Review + create.

  7. When you see the Validation Passed message, click Create.

    When the deployment succeeds, the access connector is deployed with a system-assigned managed identity.

  8. When the deployment is complete, click Go to resource.

  9. Make a note of the Resource ID.

    The resource ID is in the format:

    /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
    

Use a user-assigned managed identity

  1. If you do not already have a user-assigned managed identity, create a new one and note its resource ID.

    See Manage user-assigned managed identities.

  2. Log in to the Azure Portal as a Contributor or Owner of a resource group.

    The resource group should be in the same region as the storage account that you want to connect to.

  3. Search for Deploy a custom template and select it.

  4. Select Build your own template and paste the following template into the editor:

    {
     "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
     "contentVersion": "1.0.0.0",
     "parameters": {
         "connectorName": {
             "defaultValue": "testConnector",
             "type": "String",
             "metadata": {
                 "description": "The name of the Azure Databricks Access Connector to create."
             }
         },
         "accessConnectorRegion": {
             "defaultValue": "[resourceGroup().location]",
             "type": "String",
             "metadata": {
                 "description": "Location for the access connector resource."
             }
         },
         "userAssignedManagedIdentiy": {
             "type": "String",
             "metadata": {
                 "description": "The resource Id of the user assigned managed identity."
             }
         }
     },
     "resources": [
         {
             "type": "Microsoft.Databricks/accessConnectors",
             "apiVersion": "2023-05-01",
             "name": "[parameters('connectorName')]",
             "location": "[parameters('accessConnectorRegion')]",
             "identity": {
                 "type": "UserAssigned",
                 "userAssignedIdentities": {
                     "[parameters('userAssignedManagedIdentiy')]": {}
                 }
              }
          }
       ]
    }
    
  5. On the Basics tab, accept, select, or enter values for the following fields:

    • Subscription: The Azure subscription that the access connector will be created in. The default is the Azure subscription you are currently using. It can be any subscription in the tenant.
    • Resource group: A resource group in the same region as the storage account that you will connect to.
    • Name: A name that indicates the purpose of the connector.
    • Region: This should be the same region as the storage account that you will connect to. You can choose the pre-populated value ‘[resourceGroup().location]’ if the resource group was created in the same region as the storage account that you will connect to.
    • User Assigned Managed Identity: The Resource ID of the user-assigned managed identity that you want to use.
  6. Click Review + create.

  7. When you see the Validation Passed message, click Create.

  8. When the deployment is complete, click Go to resource.

  9. Make a note of the Resource ID.

    The resource ID is in the format:

    /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
    

Step 2: Grant the managed identity access to the storage account

To grant the permissions in this step, you must have the Owner or User Access Administrator Azure RBAC role on your storage account.

  1. Log in to your Azure Data Lake Storage Gen2 account.
  2. Go to Access Control (IAM), click + Add, and select Add role assignment.
  3. Select the Storage Blob Data Contributor role and click Next.
  4. Under Assign access to, select Managed identity.
  5. Click +Select Members, and select either Access connector for Azure Databricks or User-assigned managed identity.
  6. Search for your connector name or user-assigned identity, select it, and click Review and Assign.

Alternatively, you can limit access to the storage account by granting the managed identity access to a specific container. Follow the same steps above, but grant the Storage Blob Delegator role on the storage account and the Storage Blob Data Contributor role on the container.

Step 3: Grant the managed identity access to file events

Granting your managed identity access to file events allows Azure Databricks to subscribe to file event notifications emitted by cloud providers. This makes file processing more efficient. To grant the permissions in this step, you must have the Owner or User Access Administrator Azure RBAC role on your storage account.

  1. Log in to your Azure Data Lake Storage Gen2 account.
  2. Go to Access Control (IAM), click + Add, and select Add role assignment.
  3. Select the Storage Queue Data Contributor role, and click Next.
  4. Under Assign access to, select Managed identity.
  5. Click +Select Members, and select either Access connector for Azure Databricks or User-assigned managed identity.
  6. Search for your connector name or user-assigned identity, select it, and click Review and Assign.

Step 4: Grant Azure Databricks access to configure file events on your behalf

Note

This step is optional but highly recommended. If you do not grant Azure Databricks access to configure file events on your behalf, you must configure file events manually for each location. If you do not, you will have limited access to critical features that Databricks may release in the future.

This step allows Azure Databricks to set up file events automatically. To grant the permissions in this step, you must have the Owner or User Access Administrator Azure RBAC roles on your managed identity and the resource group that your Azure Data Lake Storage Gen2 account is in.

  1. Follow the instructions in Step 3: Grant the managed identity access to file events and assign the Storage Account Contributor, alongside the Storage Queue Data Contributor role, to your managed identity.
  2. Navigate to the Azure resource group that your Azure Data Lake Storage Gen2 account is in.
  3. Go to Access Control (IAM), click + Add, and select Add role assignment.
  4. Select the EventGrid EventSubscription Contributor role and click Next.
  5. Under Assign access to, select Managed identity.
  6. Click +Select Members, and select either Access connector for Azure Databricks or User-assigned managed identity.
  7. Search for your connector name or user-assigned identity, select it, and click Review and Assign.

Use a managed identity to access the Unity Catalog root storage account

This section describes how to give the managed identity access to the root storage account when you create a Unity Catalog metastore.

To learn how to upgrade an existing Unity Catalog metastore to use a managed identity, see Upgrade your existing Unity Catalog metastore to use a managed identity to access its root storage.

  1. As an Azure Databricks account admin, log in to the Azure Databricks account console.
  2. Click Catalog icon Catalog.
  3. Click Create Metastore.
  4. Enter values for the following fields:
    • Name for the metastore.

    • Region where the metastore will be deployed.

      For best performance, co-locate the access connector, workspaces, metastore and cloud storage location in the same cloud region.

    • ADLS Gen 2 path: enter the path to the storage container that you will use as root storage for the metastore.

      The abfss:// prefix is added automatically.

    • Access Connector ID: enter the Azure Databricks access connector’s resource ID in the format:

      /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
      
    • (Optional) Managed Identity ID: If you created the access connector using a user-assigned managed identity, enter the resource ID of the managed identity.

  5. Click Create.
  6. When prompted, select workspaces to link to the metastore.

Use a managed identity to access external storage managed in Unity Catalog

Unity Catalog gives you the ability to access existing data in storage accounts using storage credentials and external locations. Storage credentials store the managed identity, and external locations define a path to storage along with a reference to the storage credential. You can use this approach to grant and control access to existing data in cloud storage and to register external tables in Unity Catalog.

A storage credential can hold a managed identity or service principal. Using a managed identity has the benefit of allowing Unity Catalog to access storage accounts protected by network rules, which isn’t possible using service principals, and it removes the need to manage and rotate secrets.

To create a storage credential using a managed identity and assign that storage credential to an external location, follow the instructions in Connect to cloud object storage and services using Unity Catalog.

If your Azure Databricks workspace is deployed in your own Azure virtual network, also known as “VNet injection”, and you use a storage firewall to protect an Azure Data Lake Storage Gen2 account, you must:

  1. Enable your Azure Databricks workspace to access Azure Storage.
  2. Enable your managed identity to access Azure Storage.

Step 1. Enable your Azure Databricks workspace to access Azure Storage

You must configure network settings to allow your Azure Databricks workspace to access Azure Data Lake Storage Gen2. You can configure either private endpoints or access from your virtual network on Azure Data Lake Storage Gen2 to allow connections from your subnets to your Azure Data Lake Storage Gen2 account.

For instructions, see Grant your Azure Databricks workspace access to Azure Data Lake Storage Gen2.

Step 2: Enable your managed identity to access Azure Storage

This step is necessary only if “Allow Azure services on the trusted services list to access this storage account” is disabled for your Azure Storage account. If that configuration is enabled:

  • Any access connector for Azure Databricks in the same tenant as the storage account can access the storage account.
  • Any Azure trusted service can access the storage account. See Grant access to trusted Azure services.

The instructions below include a step in which you disable this configuration. You can use the Azure Portal or the Azure CLI.

Use the Azure Portal

  1. Log in to the Azure Portal, find and select the Azure Storage account, and go to the Networking tab.

  2. Set Public Network Access to Enabled from selected virtual networks and IP addresses.

    As an option, you can instead set Public Network Access to Disabled. The managed identity can be used to bypass the check on public network access.

  3. Under Resource instances, select a Resource type of Microsoft.Databricks/accessConnectors and select your Azure Databricks access connector.

  4. Under Exceptions, clear the Allow Azure services on the trusted services list to access this storage account checkbox.

Use the Azure CLI

  1. Install the Azure CLI and sign in.

    To sign in by using a Microsoft Entra ID service principal, see Azure CLI login with a Microsoft Entra ID service principal.

    To sign in by using an Azure Databricks user account, see Azure CLI login with an Azure Databricks user account.

  2. Add a network rule to the storage account:

    az storage account network-rule add \
    -–subscription <subscription id of the resource group> \
    -–resource-id <resource Id of the access connector for Azure Databricks> \
    -–tenant-id <tenant Id> \
    -g <name of the Azure Storage resource group> \
    -–account-name <name of the Azure Storage resource> \
    

    Add the resource ID in the format:

    /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
    
  3. After you create the network rule, go to your Azure Storage account in the Azure Portal and view the managed identity in the Networking tab under Resource instances, resource type Microsoft.Databricks/accessConnectors.

  4. Under Exceptions, clear the Allow Azure services on the trusted services list to access this storage account checkbox.

  5. Optionally, set Public Network Access to Disabled. The managed identity can be used to bypass the check on public network access.

    The standard approach is to keep this value set to Enabled from selected virtual networks and IP addresses.

Serverless SQL warehouses are compute resources that run in the Azure subscription for Azure Databricks, not your Azure subscription. If you configure a firewall on Azure Data Lake Storage Gen2 and you plan to use serverless SQL warehouses, you must configure the firewall to allow access from serverless SQL warehouses.

For instructions, see Configure a firewall for serverless compute access.

Upgrade your existing Unity Catalog metastore to use a managed identity to access its root storage

If you have a Unity Catalog metastore that was created using a service principal and you would like to upgrade it to use a managed identity, you can update it using an API call.

  1. Create an Access Connector for Azure Databricks and assign it permissions to the storage container that is being used for your Unity Catalog metastore root storage, using the instructions in Configure a managed identity for Unity Catalog.

    You can create the access connector with either a system-assigned managed identity or a user-assigned managed identity.

    Make a note of the access connector’s resource ID. If you use a user-assigned managed identity, also make a note of its resource ID.

  2. As an account admin, log in to an Azure Databricks workspace that is assigned to the metastore.

    You do not have to be a workspace admin.

  3. Generate a personal access token.

  4. Create an Azure Databricks authentication configuration profile in your local environment that contains the following:

    • The workspace instance name and workspace ID of the workspace where you generated your personal access token.
    • The personal access token value.

    See Azure Databricks personal access token authentication.

  5. Use the Databricks CLI to run the following command to recreate the storage credential.

    Replace the placeholder values:

    • <credential-name>: A name for the storage credential.
    • <access-connector-id>: Resource ID for the Azure Databricks access connector in the format /subscriptions/12f34567-8ace-9c10-111c-aea8eba12345c/resourceGroups/<resource-group>/providers/Microsoft.Databricks/accessConnectors/<connector-name>
    • <managed-identity-id>: If you created the access connector using a user-assigned managed identity, specify the resource ID of the managed identity.
    • <profile-name>: The name of your Azure Databricks authentication configuration profile.
    databricks storage-credentials create --json '{
      "name\": "<credential-name>",
      "azure_managed_identity": {
        "access_connector_id": "<access-connector-id>",
        "managed_identity_id": "<managed-identity-id>"
      }
    }' --profile <profile-name>
    
  6. Make a note of the storage credential ID in the response.

  7. Run the following Databricks CLI command to retrieve the metastore_id. Replace <profile-name> with the name of your Azure Databricks authentication configuration profile.

    databricks metastores summary --profile <profile-name>
    
  8. Run the following Databricks CLI command to update the metastore with the new root storage credential.

    Replace the placeholder values:

    • <metastore-id>: The metastore ID that you retrieved in the previous step.
    • <storage-credential-id>: The storage credential ID.
    • <profile-name>: The name of your Azure Databricks authentication configuration profile.
    databricks metastores update <metastore-id> \
    --storage-root-credential-id <storage-credential-id> \
    --profile <profile-name>