Libraries
To make third-party or custom code available to notebooks and jobs running on your clusters, you can install a library. Libraries can be written in Python, Java, Scala, and R. You can upload Python, Java, and Scala libraries and point to external packages in PyPI, Maven, and CRAN repositories.
Azure Databricks includes many common libraries in Databricks Runtime. To see which libraries are included in Databricks Runtime, look at the System Environment subsection of the Databricks Runtime release notes for your Databricks Runtime version.
Note
Microsoft Support helps isolate and resolve issues related to libraries installed and maintained by Azure Databricks. For third-party components, including libraries, Microsoft provides commercially reasonable support to help you further troubleshoot issues. Microsoft Support assists on a best-effort basis and might be able to resolve the issue. For open source connectors and projects hosted on Github, we recommend that you file issues on Github and follow up on them. Development efforts such as shading jars or building Python libraries are not supported through the standard support case submission process: they require a consulting engagement for faster resolution. Support might ask you to engage other channels for open-source technologies where you can find deep expertise for that technology. There are several community sites; two examples are the Microsoft Q&A page for Azure Databricks and Stack Overflow.
Cluster-scoped libraries
You can install libraries on clusters so that they can be used by all notebooks and jobs running on the cluster. Databricks supports Python, JAR, and R libraries. See Cluster libraries.
You can install a cluster library directly from the following sources:
- A package repository such as PyPI, Maven, or CRAN
- Workspace files
- Unity Catalog volumes
- A cloud object storage location
- A path on your local machine
Not all locations are supported for all types of libraries or all compute configurations. See Recommendations for uploading libraries for configuration recommendations.
Important
Libraries can be installed from DBFS when using Databricks Runtime 14.3 LTS and below. However, any workspace user can modify library files stored in DBFS. To improve the security of libraries in a Azure Databricks workspace, storing library files in the DBFS root is deprecated and disabled by default in Databricks Runtime 15.1 and above. See Storing libraries in DBFS root is deprecated and disabled by default.
Instead, Databricks recommends uploading all libraries, including Python libraries, JAR files, and Spark connectors, to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage.
For complete library support information, see Python library support, Java and Scala library support, and R library support.
Recommendations for uploading libraries
Databricks supports most configuration installations of Python, JAR, and R libraries, but there are some unsupported scenarios. It is recommended that you upload libraries to source locations that support installation onto compute with shared access mode, as this is the recommended mode for all workloads. See Access modes. When scheduling jobs with shared access mode run the job with a service principal.
Important
Only use compute with single user access mode if required functionality is not supported by shared access mode. No isolation shared access mode is a legacy configuration on Databricks that is not recommended.
The following table provides recommendations organized by Databricks Runtime version and Unity Catalog enablement.
Configuration | Recommendation |
---|---|
Databricks Runtime 13.3 LTS and above with Unity Catalog | Install libraries on compute with shared access mode from Unity Catalog volumes with GRANT READ for all account users. If applicable, Maven coordinates and JAR library paths need to be added to the allowlist. |
Databricks Runtime 11.3 LTS and above without Unity Catalog | Install libraries from workspace files. (File size limit is 500 MB.) |
Databricks Runtime 10.4 LTS and below | Install libraries from cloud object storage. |
Python library support
The following table indicates Databricks Runtime version compatibility for Python wheel files for different cluster access modes based on the library source location. See Databricks Runtime release notes versions and compatibility and Access modes.
In Databricks Runtime 15.0 and above, you can use requirements.txt files to manage your Python dependencies. These files can be uploaded to any supported source location.
Note
Installing Python egg files is only supported on Databricks Runtime 13.3 LTS and below, and only for single user or no isolation shared access modes. In addition, you cannot install Python egg files on volumes or workspace files. Use Python wheel files or install packages from PyPI instead.
Shared access mode | Single user access mode | No isolation shared access mode (Legacy) | |
---|---|---|---|
PyPI | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |
Workspace files | 13.3 LTS and above | 13.3 LTS and above | 14.1 and above |
Volumes | 13.3 LTS and above | 13.3 LTS and above | Not supported |
Cloud storage | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |
DBFS (Not recommended) | Not supported | 14.3 and below | 14.3 and below |
Java and Scala library support
The following table indicates Databricks Runtime version compatibility for JAR files for different cluster access modes based on the library source location. See Databricks Runtime release notes versions and compatibility and Access modes.
Note
Shared access mode requires an admin to add Maven coordinates and paths for JAR libraries to an allowlist
. See Allowlist libraries and init scripts on shared compute.
Shared access mode | Single user access mode | No isolation shared access mode (Legacy) | |
---|---|---|---|
Maven | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |
Workspace files | Not supported | Not supported | 14.1 and above |
Volumes | 13.3 LTS and above | 13.3 LTS and above | Not supported |
Cloud storage | 13.3 LTS and above | All supported Databricks Runtime versions | All supported Databricks Runtime versions |
DBFS (Not recommended) | Not supported | 14.3 and below | 14.3 and below |
R library support
The following table indicates Databricks Runtime version compatibility for CRAN packages for different cluster access modes. See Databricks Runtime release notes versions and compatibility and Access modes.
Shared access mode | Single user access mode | No isolation shared access mode (Legacy) | |
---|---|---|---|
CRAN | Not supported | All supported Databricks Runtime versions | All supported Databricks Runtime versions |
Notebook-scoped libraries
Notebook-scoped libraries, available for Python and R, allow you to install libraries and create an environment scoped to a notebook session. These libraries do not affect other notebooks running on the same cluster. Notebook-scoped libraries do not persist and must be re-installed for each session. Use notebook-scoped libraries when you need a custom environment for a specific notebook.
Note
JARs cannot be installed at the notebook level.
Important
Workspace libraries have been deprecated and should not be used. See Workspace libraries (legacy). However, storing libraries as workspace files is distinct from workspace libraries and is still fully supported. You can install libraries stored as workspace files directly to compute or job tasks.
Python environment management
The following table provides an overview of options you can use to install Python libraries in Azure Databricks.
Note
- Custom containers that use a conda-based environment are not compatible with notebook-scoped libraries and with cluster libraries in Databricks Runtime 10.4 LTS and above. Instead, Azure Databricks recommends installing libraries directly in the image or using init scripts. To continue using cluster libraries in those scenarios, you can set the Spark configuration
spark.databricks.driverNfs.clusterWidePythonLibsEnabled
tofalse
. Support for the Spark configuration will be removed on or after December 31, 2021.
Python package source | Notebook-scoped libraries with %pip | Notebook-scoped libraries with base environment YAML file | Cluster libraries | Job libraries with Jobs API |
---|---|---|---|---|
PyPI | Use %pip install . See example. |
Add a PyPI package name to a base environment YAML file. See example. | Select PyPI as the source. | Add a new pypi object to the job libraries and specify the package field. |
Private PyPI mirror, such as Nexus or Artifactory | Use %pip install with the --index-url option. Secret management is available. See example. |
Add the -–index-url to a base environment YAML file. Secret management is available. See example. |
Not supported. | Not supported. |
VCS, such as GitHub, with raw source | Use %pip install and specify the repository URL as the package name. See example. |
Add a repository URL as a package name to a base environment YAML file. See example. | Select PyPI as the source and specify the repository URL as the package name. | Add a new pypi object to the job libraries and specify the repository URL as the package field. |
Private VCS with raw source | Use %pip install and specify the repository URL with basic authentication as the package name. Secret management is available. See example. |
Add a repository with basic authentication as the package name to a base environment YAML file. See example. | Not supported. | Not supported. |
File path | Use %pip install . See example. |
Add a file path as a package name to a base environment YAML file. See example. | Select File path/ADLS as the source. | Add a new egg or whl object to the job libraries and specify the file path as the package field. |
Azure Data Lake Storage Gen2 | Use %pip install together with a pre-signed URL. Paths with the Azure Data Lake Storage Gen2 protocol abfss:// are not supported. |
Add a pre-signed URL as a package name to a base environment YAML file. Paths with the Azure Data Lake Storage Gen2 protocol abfss:// are not supported. |
Select File path/ADLS as the source. | Add a new egg or whl object to the job libraries and specify the Azure Data Lake Storage Gen2 path as the package field. |
Python library precedence
You might encounter a situation where you need to override the version for a built-in library, or have a custom library that conflicts in name with another library installed on the cluster. When you run import <library>
, the library with the high precedence is imported.
Important
Libraries stored in workspace files have different precedence depending on how they are added to the Python sys.path
. A Databricks Git folder adds the current working directory to the path before all other libraries, while notebooks outside Git folders add the current working directory after other libraries are installed. If you manually append workspace directories to your path, these always have the lowest precedence.
The following list orders precedence from highest to lowest. In this list, a lower number means higher precedence.
- Libraries in the current working directory (Git folders only).
- Libraries in the Git folder root directory (Git folders only).
- Notebook-scoped libraries (
%pip install
in notebooks). - Cluster libraries (using the UI, CLI, or API).
- Libraries included in Databricks Runtime.
- Libraries installed with init scripts might resolve before or after built-in libraries, depending on how they are installed. Databricks does not recommend installing libraries with init scripts.
- Libraries in the current working directory (not in Git folders).
- Workspace files appended to the
sys.path
.