Databricks Asset Bundles development
This article describes the development and lifecycle of a Databricks Asset Bundle. For general information about Databricks Asset Bundles, see What are Databricks Asset Bundles?.
Lifecycle of a bundle
To understand how to effectively use bundles, you need to understand the basic lifecycle of a bundle:
- The bundle skeleton is created based on a project.
- The bundle project is developed locally. A bundle contains configuration files that define infrastructure and workspace settings such as deployment targets, settings for Databricks resources such as jobs and pipelines, as well as source files and other artifacts.
- The bundle project is validated. Validation verifies the settings and resource definitions in the bundle configuration against the corresponding object schemas to ensure the bundle is deployable to Databricks.
- The bundle is deployed to a target workspace. Most commonly a bundle is first deployed to a user’s personal dev workspace for testing. Once testing of the bundle is finished, the bundle can be deployed to staging, then production targets.
- Workflow resources defined in the deployed bundle can be run. For example, you can run a job.
- If the bundle is no longer being used, it can be permanently destroyed.
You use the Databricks CLI bundle commands to create, validate, deploy, run, and destroy bundles, as described in the following sections.
Step 1: Create a bundle
There are three ways to begin creating a bundle:
- Use the default bundle template.
- Use a custom bundle template.
- Create a bundle manually.
Use a default bundle template
To use a Azure Databricks default bundle template to create a starter bundle that you can then customize further, use Databricks CLI version 0.218.0 or above to run the bundle init
command, which allows you to choose from a list of available templates. See Create a bundle from a project template.
databricks bundle init
You can view the source for the default bundle templates in the databricks/cli and databricks/mlops-stacks Github public repositories.
Skip ahead to Step 2: Populate the bundle configuration files.
Use a custom bundle template
To use a bundle template other than the Azure Databricks default bundle template, you must know the local path or the URL to the remote bundle template location. Use Databricks CLI version 0.218.0 or above to run the bundle init
command as follows:
databricks bundle init <project-template-local-path-or-url>
For more information about this command, see Databricks Asset Bundle project templates. For information about a specific bundle template, see the bundle template provider’s documentation.
Skip ahead to Step 2: Populate the bundle configuration files.
Create a bundle manually
To create a bundle manually instead of by using a bundle template, create a project directory on your local machine, or an empty repository with a third-party Git provider.
In your directory or repository, create one or more bundle configuration files as input. These files are expressed in YAML format. There must be at minimum one (and only one) bundle configuration file named databricks.yml
. Additional bundle configuration files must be referenced in the include
mapping of the databricks.yml
file.
To more easily and quickly create YAML files that conform to the Databricks Asset Bundle configuration syntax, you can use a tool such as Visual Studio Code, PyCharm Professional, or IntelliJ IDEA Ultimate that provide support for YAML files and JSON schema files, as follows:
Visual Studio Code
Add YAML language server support to Visual Studio Code, for example by installing the YAML extension from the Visual Studio Code Marketplace.
Generate the Databricks Asset Bundle configuration JSON schema file using Databricks CLI version 0.218.0 or above to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Use Visual Studio Code to create or open a bundle configuration file within the current directory. This file must be named
databricks.yml
.Add the following comment to the beginning of your bundle configuration file:
# yaml-language-server: $schema=bundle_config_schema.json
Note
In the preceding comment, if your Databricks Asset Bundle configuration JSON schema file is in a different path, replace
bundle_config_schema.json
with the full path to your schema file.Use the YAML language server features that you added earlier. For more information, see your YAML language server’s documentation.
PyCharm Professional
Generate the Databricks Asset Bundle configuration JSON schema file by using Databricks CLI version 0.218.0 or above to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure PyCharm to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
Use PyCharm to create or open a bundle configuration file. This file must be named
databricks.yml
. As you type, PyCharm checks for JSON schema syntax and formatting and provides code completion hints.
IntelliJ IDEA Ultimate
Generate the Databricks Asset Bundle configuration JSON schema file by using Databricks CLI version 0.218.0 or above to run the
bundle schema
command and redirect the output to a JSON file. For example, generate a file namedbundle_config_schema.json
within the current directory, as follows:databricks bundle schema > bundle_config_schema.json
Configure IntelliJ IDEA to recognize the bundle configuration JSON schema file, and then complete the JSON schema mapping, by following the instructions in Configure a custom JSON schema.
Use IntelliJ IDEA to create or open a bundle configuration file. This file must be named
databricks.yml
. As you type, IntelliJ IDEA checks for JSON schema syntax and formatting and provides code completion hints.
Step 2: Populate the bundle configuration files
Bundle configuration files define your Azure Databricks workflows by specifying settings such as workspace details, artifact names, file locations, job details, and pipeline details. Typically bundle configuration also contains development, staging, and production deployment targets. For detailed information about bundle configuration files, see Databricks Asset Bundle configuration.
You can use the bundle generate
command to autogenerate bundle configuration for an existing resource in the workspace, then use bundle deployment bind
to link the bundle configuration to the resource in the workspace to keep them in sync. See Generate a bundle configuration file and Bind bundle resources.
Step 3: Validate the bundle configuration files
Before you deploy artifacts or run a job or pipeline, you should verify that definitions in your bundle configuration files are valid. To do this, run the bundle validate
command from the bundle project root directory. See Validate a bundle.
databricks bundle validate
If the validation is successful, a summary of the bundle identity and a confirmation message is returned. To output the schema, use the databricks bundle schema
command. See Display the bundle configuration schema.
Step 4: Deploy the bundle
Before you deploy the bundle, make sure that the remote workspace has workspace files enabled. See What are workspace files?.
To deploy a bundle to a remote workspace, run the bundle deploy
command from the bundle root as described in Deploy a bundle. The Databricks CLI deploys to the target workspace that is declared within the bundle configuration files. See targets.
databricks bundle deploy
A bundle’s unique identity is defined by its name, target, and the identity of the deployer. If these attributes are identical across different bundles, deployment of these bundles will interfere with one another. See Deploy a bundle for additional details.
Tip
You can run databricks bundle
commands outside of the bundle root by setting the BUNDLE_ROOT
environment variable. If this environment variable is not set, databricks bundle
commands attempt to find the bundle root by searching within the current working directory.
Step 5: Run the bundle
To run a specific job or pipeline, run the bundle run
command from the bundle root, specifying the job or pipeline key declared within the bundle configuration files, as described in Run a job or pipeline. The resource key is the top-level element of the resource’s YAML block. If you do not specify a job or pipeline key, you are prompted to select a resource to run from a list of available resources. If the -t
option is not specified, the default target as declared within the bundle configuration files is used. For example, to run a job with the key hello_job
within the context of the default target:
databricks bundle run hello_job
To run a job with a key hello_job
within the context of a target declared with the name dev
:
databricks bundle run -t dev hello_job
Step 6: Destroy the bundle
Warning
Destroying a bundle permanently deletes a bundle’s previously-deployed jobs, pipelines, and artifacts. This action cannot be undone.
If you are finished with your bundle and want to delete jobs, pipelines, and artifacts that were previously deployed, run the bundle destroy
command from the bundle root. This command deletes all previously-deployed jobs, pipelines, and artifacts that are defined in the bundle configuration files. See Destroy a bundle.
databricks bundle destroy
By default, you are prompted to confirm permanent deletion of the previously-deployed jobs, pipelines, and artifacts. To skip these prompts and perform automatic permanent deletion, add the --auto-approve
option to the bundle destroy
command.