Jaa


Work with Python and R modules

This article describes how you can use relative paths to import custom Python and R modules stored in workspace files alongside your Databricks notebooks. Workspace files can facilitate tighter development lifecycles, allowing you to modularize your code, convert %run commands to import statements, and refactor Python wheel files to co-versioned modules. You can also use the built-in Databricks web terminal to test your code.

Note

In Databricks Runtime 14.0 and above, the the default current working directory (CWD) for code executed locally is the directory containing the notebook or script being run. This is a change in behavior from Databricks Runtime 13.3 LTS and below. See What is the default current working directory?.

Import Python and R modules

Important

In Databricks Runtime 13.3 LTS and above, directories added to the Python sys.path, or directories that are structured as Python packages, are automatically distributed to all executors in the cluster. In Databricks Runtime 12.2 LTS and below, libraries added to the sys.path must be explicitly installed on executors.

In Databricks Runtime 11.3 LTS and above, the current working directory of your notebook is automatically added to the Python path. If you’re using Git folders, the root repo directory is added.

To import modules from another directory, you must add the directory containing the module to sys.path. You can specify directories using a relative path, as in the following example:

import sys
import os
sys.path.append(os.path.abspath('..'))

You import functions from a module stored in workspace files just as you would from a module saved as a cluster library or notebook-scoped library:

Python

from sample import power
power.powerOfTwo(3)

R

source("sample.R")
power.powerOfTwo(3)

Important

When you use an import statement and multiple libraries of the same name exist, Databricks uses precedence rules to determine which library to load. See Python library precedence.

Autoreload for Python modules

If you are editing multiple files while developing Python code, you can enable the autoreload extension to reload any imported modules automatically so that command runs pick up those edits. Use the following commands in any notebook cell or Python file to enable the autoreload extension:

%load_ext autoreload
%autoreload 2

The autoreload extension works only in the Spark driver process and does not reload code into Spark executor processes. Because it works only on the Spark driver node and not nodes running the Spark executor, you should not use autoreload when developing modules that run on worker nodes (for example, UDFs).

In Databricks Runtime 16.0 and above, the autoreload extension in Databricks adds the following features:

  • Support for targeted reloading of modules for modifications internal to functions. Reloading just the changed portion of a module whenever possible ensures that there is only one externally visible instance of each object, which is safer and more reliable.
  • When you import a Python module from a workspace file, Databricks automatically suggests using autoreload if the module has changed since its last import.

To learn more about the autoreload extension, see the IPython autoreload documentation.

Refactor code

A best practice for code development is to modularize code so it can be easily reused. You can create custom Python files with workspace files and make the code in those files available to a notebook using the import statement.

To refactor notebook code into reusable files:

  1. Create a new source code file for your code.
  2. Add Python import statements to the notebook to make the code in your new file available to the notebook.

Migrate from %run commands

If you are using %run commands to make Python or R functions defined in a notebook available to another notebook, or are installing custom .whl files on a cluster, consider including those custom modules as workspace files. This way, you can keep your notebooks and other code modules in sync, ensuring that your notebook always uses the correct version.

%run commands let you include one notebook in another and are often used to make supporting Python or R code available to a notebook. In this example, a notebook named power.py includes the code below.

# This code is in a notebook named "power.py".
def n_to_mth(n,m):
  print(n, "to the", m, "th power is", n**m)

You can then make functions defined in power.py available to a different notebook with a %run command:

# This notebook uses a %run command to access the code in "power.py".
%run ./power
n_to_mth(3, 4)

Using workspace files, you can directly import the module that contains the Python code and run the function.

from power import n_to_mth
n_to_mth(3, 4)

Refactor Python .whl files to relative libraries

You can install custom .whl files onto a cluster and then import them into a notebook attached to that cluster. However, this process might be cumbersome and error-prone for frequently updated code. Workspace files let you keep these Python files in the same directory as the notebooks that use the code, ensuring that your notebook always uses the correct version.

For more information about packaging Python projects, see this tutorial.

Use Azure Databricks web terminal for testing

You can use the Azure Databricks web terminal to test modifications to your Python or R code without using a notebook to import and run the file.

  1. Open web terminal.
  2. Change to the directory: cd /Workspace/Users/<path-to-directory>/.
  3. Run the Python or R file: python file_name.py or Rscript file_name.r.