Bundle configuration examples

Artikkeli
12/20/2024

This article provides example configuration for Databricks Asset Bundles features and common bundle use cases.

Tip

Some of the examples in this article, as well as others, can be found in the bundle-examples GitHub repository.

Job that uses serverless compute

Databricks Asset Bundles support jobs that run on serverless compute. To configure this, you can either omit the clusters setting for a job with a notebook task, or you can specify an environment as shown in the examples below. For Python script, Python wheel, and dbt tasks, environment_key is required for serverless compute. See environment_key.

# A serverless job (no cluster definition)
resources:
  jobs:
    serverless_job_no_cluster:
      name: serverless_job_no_cluster

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: notebook_task
          notebook_task:
            notebook_path: ../src/notebook.ipynb

# A serverless job (environment spec)
resources:
  jobs:
    serverless_job_environment:
      name: serverless_job_environment

      tasks:
        - task_key: task
          spark_python_task:
            python_file: ../src/main.py

          # The key that references an environment spec in a job.
          # https://docs.databricks.com/api/workspace/jobs/create#tasks-environment_key
          environment_key: default

      # A list of task execution environment specifications that can be referenced by tasks of this job.
      environments:
        - environment_key: default

          # Full documentation of this spec can be found at:
          # https://docs.databricks.com/api/workspace/jobs/create#environments-spec
          spec:
            client: "1"
            dependencies:
              - my-library

Pipeline that uses serverless compute

Databricks Asset Bundles support pipelines that run on serverless compute. To configure this, set the pipeline serverless setting to true. The following example configuration defines a pipeline that runs on serverless compute and a job that triggers a refresh of the pipeline every hour.

# A pipeline that runs on serverless compute
resources:
  pipelines:
    my_pipeline:
      name: my_pipeline
      target: ${bundle.environment}
      serverless: true
      catalog: users
      libraries:
        - notebook:
            path: ../src/my_pipeline.ipynb

      configuration:
        bundle.sourcePath: /Workspace/${workspace.file_path}/src

# This defines a job to refresh a pipeline that is triggered every hour
resources:
  jobs:
    my_job:
      name: my_job

      # Run this job once an hour.
      trigger:
        periodic:
          interval: 1
          unit: HOURS

      email_notifications:
        on_failure:
          - someone@example.com

      tasks:
        - task_key: refresh_pipeline
          pipeline_task:
            pipeline_id: ${resources.pipelines.my_pipeline.id}

Job with a SQL notebook

The following example configuration defines a job with a SQL notebook.

resources:
  jobs:
    job_with_sql_notebook:
      name: "Job to demonstrate using a SQL notebook with a SQL warehouse"
      tasks:
        - task_key: notebook
          notebook_task:
            notebook_path: ./select.sql
            warehouse_id: 799f096837fzzzz4

Job with multiple wheel files

The following example configuration defines a bundle that contains a job with multiple *.whl files.

# job.yml
resources:
  jobs:
    example_job:
      name: "Example with multiple wheels"
      tasks:
        - task_key: task

          spark_python_task:
            python_file: ../src/call_wheel.py

          libraries:
            - whl: ../my_custom_wheel1/dist/*.whl
            - whl: ../my_custom_wheel2/dist/*.whl

          new_cluster:
            node_type_id: i3.xlarge
            num_workers: 0
            spark_version: 14.3.x-scala2.12
            spark_conf:
                "spark.databricks.cluster.profile": "singleNode"
                "spark.master": "local[*, 4]"
            custom_tags:
                "ResourceClass": "SingleNode"

# databricks.yml
bundle:
  name: job_with_multiple_wheels

include:
  - ./resources/job.yml

workspace:
  host: https://myworkspace.cloud.databricks.com

artifacts:
  my_custom_wheel1:
    type: whl
    build: poetry build
    path: ./my_custom_wheel1

  my_custom_wheel2:
    type: whl
    build: poetry build
    path: ./my_custom_wheel2

targets:
  dev:
    default: true
    mode: development

Job that uses a requirements.txt file

The following example configuration defines a job that uses a requirements.txt file.

resources:
  jobs:
    job_with_requirements_txt:
      name: "Example job that uses a requirements.txt file"
      tasks:
        - task_key: task
          job_cluster_key: default
          spark_python_task:
            python_file: ../src/main.py
          libraries:
            - requirements: /Workspace/${workspace.file_path}/requirements.txt

Job on a schedule

The following examples show configuration for jobs that run on a schedule. For information about job schedules and triggers, see Automating jobs with schedules and triggers.

This configuration defines a job that runs daily at a specified time:

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      schedule:
        quartz_cron_expression: "0 0 8 * * ?" # daily at 8am
        timezone_id: UTC
        pause_status: UNPAUSED

In this configuration, the job runs one week after the job was last run:

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      trigger:
        pause_status: UNPAUSED
        periodic:
          interval: 1
          unit: WEEKS

Bundle that uploads a JAR file to Unity Catalog

You can specify Unity Catalog volumes as an artifact path so that all artifacts, such as JAR files and wheel files, are uploaded to Unity Catalog volumes. The following example bundle uploads a JAR file to Unity Catalog. For information on the artifact_path mapping, see artifact_path.

bundle:
  name: jar-bundle

workspace:
  host: https://myworkspace.cloud.databricks.com
  artifact_path: /Volumes/main/default/my_volume

artifacts:
  my_java_code:
    path: ./sample-java
    build: "javac PrintArgs.java && jar cvfm PrintArgs.jar META-INF/MANIFEST.MF PrintArgs.class"
    files:
      - source: ./sample-java/PrintArgs.jar

resources:
  jobs:
    jar_job:
      name: "Spark Jar Job"
      tasks:
        - task_key: SparkJarTask
          new_cluster:
            num_workers: 1
            spark_version: "14.3.x-scala2.12"
            node_type_id: "i3.xlarge"
          spark_jar_task:
            main_class_name: PrintArgs
          libraries:
            - jar: ./sample-java/PrintArgs.jar

Jaa