Install libraries from object storage

Artikkeli
08/29/2024

This article walks you through the steps required to install libraries from cloud object storage on Azure Databricks.

Note

This article refers to cloud object storage as a general concept, and assumes that you are directly interacting with data stored in object storage using URIs. Databricks recommends using Unity Catalog volumes to configure access to files in cloud object storage. See What are Unity Catalog volumes?.

You can store custom JAR and Python Whl libraries in cloud object storage, instead of storing them in the DBFS root. See Cluster-scoped libraries for full library compatibility details.

Important

Libraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Azure Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.1 and above. See Storing libraries in DBFS root is deprecated and disabled by default.

Instead, Databricks recommends uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage.

Load libraries to object storage

You can load libraries to object storage the same way you load other files. You must have proper permissions in your cloud provider to create new object storage containers or load files into cloud object storage.

Grant read-only permissions to object storage

Databricks recommends configuring all privileges related to library installation with read-only permissions.

Azure Databricks allows you to assign security permissions to individual clusters that govern access to data in cloud object storage. These policies can be expanded to add read-only access to cloud object storage that contains libraries.

Note

In Databricks Runtime 12.2 LTS and below, you cannot load JAR libraries when using clusters with shared access modes. In Databricks Runtime 13.3 LTS and above, you must add JAR libraries to the Unity Catalog allowlist. See Allowlist libraries and init scripts on shared compute.

Databricks recommends using Microsoft Entra ID service principals to manage access to libraries stored in Azure Data Lake Storage Gen2. Use the following linked documentation to complete this setup:

Create a service principal with read and list permissions on your desired blobs. See Access storage using a service principal & Microsoft Entra ID(Azure Active Directory).
Save your credentials using secrets. See Manage secrets.

Set the properties in the Spark configuration and environmental variables while creating a cluster, as in the following example:

Spark config:

spark.hadoop.fs.azure.account.auth.type.<storage-account>.dfs.core.windows.net OAuth
spark.hadoop.fs.azure.account.oauth.provider.type.<storage-account>.dfs.core.windows.net org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
spark.hadoop.fs.azure.account.oauth2.client.id.<storage-account>.dfs.core.windows.net <application-id>
spark.hadoop.fs.azure.account.oauth2.client.secret.<storage-account>.dfs.core.windows.net {{secrets/<secret-scope>/<service-credential-key>}}
spark.hadoop.fs.azure.account.oauth2.client.endpoint.<storage-account>.dfs.core.windows.net https://login.microsoftonline.com/<tenant-id>/oauth2/token

Environmental variables:

SERVICE_CREDENTIAL={{secrets/<secret-scope>/<service-credential-key>}}

(Optional) Refactor init scripts using azcopy or the Azure CLI.

You can reference environmental variables set during cluster configuration within your init scripts to pass credentials stored as secrets for validation.

Install libraries to clusters

To install a library stored in cloud object storage to a cluster, complete the following steps:

Select a cluster from the list in the clusters UI.
Select the Libraries tab.
Select the File path/ADLS option.
Provide the full URI path to the library object (for example, abfss://container-name@storage-account-name.dfs.core.windows.net/path/to/library.whl).
Click Install.

You can also install libraries using the REST API or CLI.

Install libraries to notebooks

You can use %pip to install custom Python wheel files stored in object storage scoped to a notebook-isolated SparkSession. To use this method, you must either store libraries in publicly readable object storage or use a pre-signed URL.

See Notebook-scoped Python libraries.

Note

JAR libraries cannot be installed in the notebook. You must install JAR libraries at the cluster level.

Jaa

Install libraries from object storage

Load libraries to object storage

Grant read-only permissions to object storage

Install libraries to clusters

Install libraries to notebooks

Palaute

Lisäresursseja