Jaa


Develop Delta Live Tables pipelines with Databricks Asset Bundles

Databricks Asset Bundles, also known simply as bundles, enable you to programmatically validate, deploy, and run Azure Databricks resources such as Delta Live Tables pipelines. See What are Databricks Asset Bundles?.

This article describes how to create a bundle to programmatically manage a pipeline. See What is Delta Live Tables?. The bundle is created using the Databricks Asset Bundles default bundle template for Python, which consists of a notebook paired with the definition of a pipeline and job to run it. You then validate, deploy, and run the deployed pipeline in your Azure Databricks workspace.

Tip

If you have existing pipelines that were created using the Azure Databricks user interface or API that you want to move to bundles, you must define them in a bundle’s configuration files. Databricks recommends that you first create a bundle using the steps below and then validate whether the bundle works. You can then add additional definitions, notebooks, and other sources to the bundle. See Add an existing pipeline definition to a bundle.

Requirements

(Optional) Install a Python module to support local pipeline development

Databricks provides a Python module to assist your local development of Delta Live Tables pipeline code by providing syntax checking, autocomplete, and data type checking as you write code in your IDE.

The Python module for local development is available on PyPi. To install the module, see Python stub for Delta Live Tables.

Create a bundle using a project template

Create the bundle using the Azure Databricks default bundle template for Python. This template consists of a notebook that defines a Delta Live Tables pipeline, which filters data from the original dataset. For more information about bundle templates, see Databricks Asset Bundle project templates.

If you want to create a bundle from scratch, see Create a bundle manually.

Step 1: Set up authentication

In this step, you set up authentication between the Databricks CLI on your development machine and your Azure Databricks workspace. This article assumes that you want to use OAuth user-to-machine (U2M) authentication and a corresponding Azure Databricks configuration profile named DEFAULT for authentication.

Note

U2M authentication is appropriate for trying out these steps in real time. For fully automated workflows, Databricks recommends that you use OAuth machine-to-machine (M2M) authentication instead. See the M2M authentication setup instructions in Authentication.

  1. Use the Databricks CLI to initiate OAuth token management locally by running the following command for each target workspace.

    In the following command, replace <workspace-url> with your Azure Databricks per-workspace URL, for example https://adb-1234567890123456.7.azuredatabricks.net.

    databricks auth login --host <workspace-url>
    
  2. The Databricks CLI prompts you to save the information that you entered as an Azure Databricks configuration profile. Press Enter to accept the suggested profile name, or enter the name of a new or existing profile. Any existing profile with the same name is overwritten with the information that you entered. You can use profiles to quickly switch your authentication context across multiple workspaces.

    To get a list of any existing profiles, in a separate terminal or command prompt, use the Databricks CLI to run the command databricks auth profiles. To view a specific profile’s existing settings, run the command databricks auth env --profile <profile-name>.

  3. In your web browser, complete the on-screen instructions to log in to your Azure Databricks workspace.

  4. To view a profile’s current OAuth token value and the token’s upcoming expiration timestamp, run one of the following commands:

    • databricks auth token --host <workspace-url>
    • databricks auth token -p <profile-name>
    • databricks auth token --host <workspace-url> -p <profile-name>

    If you have multiple profiles with the same --host value, you might need to specify the --host and -p options together to help the Databricks CLI find the correct matching OAuth token information.

Step 2: Create the bundle

Initialize a bundle using the default Python bundle project template.

  1. Use your terminal or command prompt to switch to a directory on your local development machine that will contain the template’s generated bundle.

  2. Use the Databricks CLI to run the bundle init command:

    databricks bundle init
    
  3. For Template to use, leave the default value of default-python by pressing Enter.

  4. For Unique name for this project, leave the default value of my_project, or type a different value, and then press Enter. This determines the name of the root directory for this bundle. This root directory is created within your current working directory.

  5. For Include a stub (sample) notebook, select no and press Enter. This instructs the Databricks CLI to not add a sample notebook at this point, as the sample notebook that is associated with this option has no Delta Live Tables code in it.

  6. For Include a stub (sample) DLT pipeline, leave the default value of yes by pressing Enter. This instructs the Databricks CLI to add a sample notebook that has Delta Live Tables code in it.

  7. For Include a stub (sample) Python package, select no and press Enter. This instructs the Databricks CLI to not add sample Python wheel package files or related build instructions to your bundle.

Step 3: Explore the bundle

To view the files that the template generated, switch to the root directory of your newly created bundle. Files of particular interest include the following:

  • databricks.yml: This file specifies the bundle’s programmatic name, includes a reference to the pipeline definition, and specifies settings about the target workspace.
  • resources/<project-name>_job.yml and resources/<project-name>_pipeline.yml: These files define the job that contains a pipeline refresh task, and the pipeline’s settings.
  • src/dlt_pipeline.ipynb: This file is a notebook that, when run, executes the pipeline.

For customizing pipelines, the mappings within a pipeline declaration correspond to the create pipeline operation’s request payload as defined in POST /api/2.0/pipelines in the REST API reference, expressed in YAML format.

Step 4: Validate the project’s bundle configuration file

In this step, you check whether the bundle configuration is valid.

  1. From the root directory, use the Databricks CLI to run the bundle validate command, as follows:

    databricks bundle validate
    
  2. If a summary of the bundle configuration is returned, then the validation succeeded. If any errors are returned, fix the errors, and then repeat this step.

If you make any changes to your bundle after this step, you should repeat this step to check whether your bundle configuration is still valid.

Step 5: Deploy the local project to the remote workspace

In this step, you deploy the local notebook to your remote Azure Databricks workspace and create the Delta Live Tables pipeline within your workspace.

  1. From the bundle root, use the Databricks CLI to run the bundle deploy command as follows:

    databricks bundle deploy -t dev
    
  2. Check whether the local notebook was deployed: In your Azure Databricks workspace’s sidebar, click Workspace.

  3. Click into the Users > <your-username> > .bundle > <project-name> > dev > files > src folder. The notebook should be in this folder.

  4. Check whether the pipeline was created: In your Azure Databricks workspace’s sidebar, click Delta Live Tables.

  5. On the Delta Live Tables tab, click [dev <your-username>] <project-name>_pipeline.

If you make any changes to your bundle after this step, you should repeat steps 4-5 to check whether your bundle configuration is still valid and then redeploy the project.

Step 6: Run the deployed project

In this step, you trigger a run of the Delta Live Tables pipeline in your workspace from the command line.

  1. From the root directory, use the Databricks CLI to run the bundle run command, as follows, replacing <project-name> with the name of your project from Step 2:

    databricks bundle run -t dev <project-name>_pipeline
    
  2. Copy the value of Update URL that appears in your terminal and paste this value into your web browser to open your Azure Databricks workspace.

  3. In your Azure Databricks workspace, after the pipeline completes successfully, click the taxi_raw view and the filtered_taxis materialized view to see the details.

If you make any changes to your bundle after this step, you should repeat steps 4-6 to check whether your bundle configuration is still valid, redeploy the project, and run the redeployed project.

Step 7: Clean up

In this step, you delete the deployed notebook and the pipeline from your workspace.

  1. From the root directory, use the Databricks CLI to run the bundle destroy command, as follows:

    databricks bundle destroy -t dev
    
  2. Confirm the pipeline deletion request: When prompted to permanently destroy resources, type y and press Enter.

  3. Confirm the notebook deletion request: When prompted to permanently destroy the previously deployed folder and all of its files, type y and press Enter.

  4. If you also want to delete the bundle from your development machine, you can now delete the local directory from Step 2.

Add an existing pipeline definition to a bundle

You can use an existing Delta Live Tables pipeline definition as a basis to define a new pipeline in a bundle configuration file. To get an existing pipeline definition, you can manually retrieve it using the UI, or you can generate it programmatically using the Databricks CLI.

Get an existing pipeline definition using the UI

To get the YAML representation of an existing pipeline definition from the Azure Databricks workspace UI:

  1. In your Azure Databricks workspace’s sidebar, click Workflows.

  2. On the Delta Live Tables tab, click your pipeline’s Name link.

  3. Next to the Development button, click the kebab, and then click View settings YAML.

  4. Copy the pipeline definition’s YAML in the Pipeline settings YAML dialog to your local clipboard by clicking the copy icon.

  5. Add the YAML that you copied to your bundle’s databricks.yml file, or create a configuration file for your pipeline in the resources folder of your bundle project and reference it from your databricks.yml file. See resources.

  6. Download and add any Python files and notebooks that are referenced to the bundle’s project source. Typically bundle artifacts are in the src directory in a bundle.

    Tip

    You can export an existing notebook from a Azure Databricks workspace into the .ipynb format by clicking File > Export > IPython Notebook from the Azure Databricks notebook user interface.

    After you add your notebooks, Python files, and other artifacts to the bundle, make sure that your pipeline definition properly references them. For example, for a notebook named hello.ipynb that is in the src/ directory of the bundle:

    resources:
      pipelines:
        hello-pipeline:
          name: hello-pipeline
          libraries:
            - notebook:
                path: ../src/hello.ipynb
    

Generate an existing pipeline definition using the Databricks CLI

To programmatically generate bundle configuration for an existing pipeline:

  1. Retrieve the ID of the existing pipeline from the Pipeline details side panel for the pipeline in the UI, or use the Databricks CLI databricks pipelines list-pipelines command.

  2. Run the bundle generate pipeline Databricks CLI command, setting the pipeline ID:

    databricks bundle generate pipeline --existing-pipeline-id 6565621249
    

    This command creates a bundle configuration file for the pipeline in the bundle’s resources folder and downloads any referenced artifacts to the src folder.

    Tip

    If you first use bundle deployment bind to bind a resource in a bundle to one in the workspace, the resource in the workspace is updated based on the configuration defined in the bundle it is bound to after the next bundle deploy. For information on bundle deployment bind, see Bind bundle resources.