Κοινή χρήση μέσω


What are Databricks Asset Bundles?

Databricks Asset Bundles (DABs) are a tool to facilitate the adoption of software engineering best practices, including source control, code review, testing, and continuous integration and delivery (CI/CD), for your data and AI projects. Bundles make it possible to describe Databricks resources such as jobs, pipelines, and notebooks as source files. These source files provide an end-to-end definition of a project, including how it should be structured, tested, and deployed, which makes it easier to collaborate on projects during active development.

Bundles provide a way to include metadata alongside your project’s source files. When you deploy a project using bundles, this metadata is used to provision infrastructure and other resources. Your project’s collection of source files and metadata is then deployed as a single bundle to your target environment. A bundle includes the following parts:

  • Required cloud infrastructure and workspace configurations
  • Source files, such as notebooks and Python files, that include the business logic
  • Definitions and settings for Databricks resources, such as Azure Databricks jobs, Delta Live Tables pipelines, Model Serving endpoints, MLflow Experiments, and MLflow registered models
  • Unit tests and integration tests

The following diagram provides a high-level view of a development and CI/CD pipeline with bundles:

Databricks Asset Bundles overview

When should I use Databricks Asset Bundles?

Databricks Assets Bundles are an infrastructure-as-code (IaC) approach to managing your Databricks projects. Use them when you want to manage complex projects where multiple contributors and automation are essential, and continuous integration and deployment (CI/CD) are a requirement. Since bundles are defined and managed through YAML templates and files you create and maintain alongside source code, they map well to scenarios where IaC is an appropriate approach.

Some ideal scenarios for bundles include:

  • Develop data, analytics, and ML projects in a team-based environment. Bundles can help you organize and manage various source files efficiently. This ensures smooth collaboration and streamlined processes.
  • Iterate on ML problems faster. Manage ML pipeline resources (such as training and batch inference jobs) by using ML projects that follow production best practices from the beginning.
  • Set organizational standards for new projects by authoring custom bundle templates that include default permissions, service principals, and CI/CD configurations.
  • Regulatory compliance: In industries where regulatory compliance is a significant concern, bundles can help maintain a versioned history of code and infrastructure work. This assists in governance and ensures that necessary compliance standards are met.

How do Databricks Asset Bundles work?

Bundle metadata is defined using YAML files that specify the artifacts, resources, and configuration of a Databricks project. You can create this YAML file manually or generate one using a bundle template. The Databricks CLI can then be used to validate, deploy, and run bundles using these bundle YAML files. You can run bundle projects from IDEs, terminals, or within Databricks directly. This article uses the Databricks CLI.

Bundles can be created manually or based on a template. The Databricks CLI provides default templates for simple use cases, but for more specific or complex jobs, you can create custom bundle templates to implement your team’s best practices and keep common configurations consistent.

For more details on the configuration YAML used to express Databricks Asset Bundles, see Databricks Asset Bundle configuration.

Configure your environment to use bundles

Use the Databricks CLI to easily deploy bundles from the command line. To install the Databricks CLI, see Install or update the Databricks CLI.

Databricks Asset Bundles are available in Databricks CLI version 0.218.0 or above. To find the version of the Databricks CLI that is installed, run the following command:

databricks --version

After installing the Databricks CLI, verify that your remote Databricks workspaces are configured correctly. Bundles require the workspace files feature to be enabled as this feature supports working with files other than Databricks Notebooks, such as .py and .yml files. If you’re using Databricks Runtime version 11.3 LTS or above, this feature is enabled by default.

Authentication

Azure Databricks provides several authentication methods:

  • For attended authentication scenarios, such as manual workflows where you use your web browser to log in to your target Azure Databricks workspace (when prompted by the Databricks CLI), use OAuth user-to-machine (U2M) authentication. This method is ideal for experimenting with the getting started tutorials for Databricks Asset Bundles or for the rapid development of bundles.
  • For unattended authentication scenarios, such as fully automated workflows in which there is no opportunity for you to use your web browser to log in to your target Azure Databricks workspace at that time, use OAuth machine-to-machine (M2M) authentication. This method requires the use of Azure Databricks service principals and is ideal for using Databricks Asset Bundles with CI/CD systems such as GitHub.

For OAuth U2M authentication, do the following:

  1. Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

    In the following command, replace <workspace-url> with your Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.

    databricks auth login --host <workspace-url>
    
  2. The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

    To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.

  3. In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.

  4. To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:

    • databricks auth token --host <workspace-url>
    • databricks auth token -p <profile-name>
    • databricks auth token --host <workspace-url> -p <profile-name>

    If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

You can use this configuration profile’s name in one or more of the following ways whenever you validate, deploy, run, or destroy bundles:

  • With the command-line option -p <profile-name>, appended to the commands databricks bundle validate, databricks bundle deploy, databricks bundle run, or databricks bundle destroy. See Databricks Asset Bundles development.
  • As the value of the profile mapping in the bundle configuration file’s top-level workspace mapping (although Databricks recommends that you use the host mapping set to the Azure Databricks workspace’s URL instead of the profile mapping, as it makes your bundle configuration files more portable). See coverage of the profile mapping in workspace.
  • If the configuration profile’s name is DEFAULT, it is used by default when the command-line option -p <profile-name> or the profile (or host) mapping is not specified.

For OAuth M2M authentication, do the following:

  1. Complete the OAuth M2M authentication setup instructions. See Authenticate access to Azure Databricks with a service principal using OAuth (OAuth M2M).

  2. Install the Databricks CLI on the target compute resource in one of the following ways:

    • To manually install the Databricks CLI on the compute resource in real time, see Install or update the Databricks CLI.
    • To use GitHub Actions to automatically install the Databricks CLI on a GitHub virtual machine, see setup-cli in GitHub.
    • To use other CI/CD systems to automatically install the Databricks CLI on a virtual machine, see see your CI/CD system provider’s documentation and Install or update the Databricks CLI.
  3. Set the following environment variables on the compute resource as follows:

    • DATABRICKS_HOST, set to the Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.
    • DATABRICKS_CLIENT_ID, set to the Azure Databricks service principal’s Application ID value.
    • DATABRICKS_CLIENT_SECRET, set to the Azure Databricks service principal’s OAuth Secret value.

    To set these environment variables, see the documentation for your target compute resource’s operating system or CI/CD system.

Develop your first Databricks Asset Bundle

The fastest way to start bundle development is by using a bundle project template. Create your first bundle project using the Databricks CLI bundle init command. This command presents a choice of Databricks-provided default bundle templates and asks a series of questions to initialize project variables.

databricks bundle init

Creating your bundle is the first step in the lifecycle of a bundle. The second step is developing your bundle, a key element of which is defining bundle settings and resources in the databricks.yml and resource configuration files. For information about bundle configuration, see Databricks Asset Bundle configuration.

Tip

Bundle configuration examples can be found in Bundle configuration examples and the Bundle examples repository in GitHub.

Next steps