Add tasks to jobs in Databricks Asset Bundles

Άρθρο
12/05/2024

This article provides examples of various types of tasks that you can add to Azure Databricks jobs in Databricks Asset Bundles. See What are Databricks Asset Bundles?.

Most job task types have task-specific parameters among their supported settings, but you can also define job parameters that get passed to tasks. Dynamic value references are supported for job parameters, which enable passing values specific to the job run between tasks. See What is a dynamic value reference?.

Note

You can override job task settings. See Override job tasks settings in Databricks Asset Bundles.

Tip

To quickly generate resource configuration for an existing job using the Databricks CLI, you can use the bundle generate job command. See bundle commands.

Notebook task

You use this task to run a notebook.

The following example adds a notebook task to a job and sets a job parameter named my_job_run_id. The path for the notebook to deploy is relative to the configuration file in which this task is declared. The task gets the notebook from its deployed location in the Azure Databricks workspace.

resources:
  jobs:
    my-notebook-job:
      name: my-notebook-job
      tasks:
        - task_key: my-notebook-task
          notebook_task:
            notebook_path: ./my-notebook.ipynb
      parameters:
        - name: my_job_run_id
          default: "{{job.run_id}}"

For additional mappings that you can set for this task, see tasks > notebook_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See Notebook task for jobs.

If/else condition task

The condition_task enables you to add a task with if/else conditional logic to your job. The task evaluates a condition that can be used to control the execution of other tasks. The condition task does not require a cluster to execute and does not support retries or notifications. For more information about the if/else task, see Add branching logic to a job with the If/else task.

The following example contains a condition task and a notebook task, where the notebook task only executes if the number of job repairs is less than 5.

resources:
  jobs:
    my-job:
      name: my-job
      tasks:
        - task_key: condition_task
          condition_task:
            op: LESS_THAN
            left: "{{job.repair_count}}"
            right: "5"
        - task_key: notebook_task
          depends_on:
            - task_key: condition_task
              outcome: "true"
          notebook_task:
            notebook_path: ../src/notebook.ipynb

For additional mappings that you can set for this task, see tasks > condition_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

For each task

The for_each_task enables you to add a task with a for each loop to your job. The task executes a nested task for every input provided. For more information about the for_each_task, see Run a parameterized Azure Databricks job task in a loop.

The following example adds a for_each_task to a job, where it loops over the values of another task and processes them.

resources:
  jobs:
    my_job:
      name: my_job
      tasks:
        - task_key: generate_countries_list
          notebook_task:
            notebook_path: ../src/generate_countries_list.ipnyb
        - task_key: process_countries
          depends_on:
            - task_key: generate_countries_list
          for_each_task:
            inputs: "{{tasks.generate_countries_list.values.countries}}"
            task:
              task_key: process_countries_iteration
              notebook_task:
                notebook_path: ../src/process_countries_notebook.ipnyb

For additional mappings that you can set for this task, see tasks > for_each_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

Python script task

You use this task to run a Python file.

The following example adds a Python script task to a job. The path for the Python file to deploy is relative to the configuration file in which this task is declared. The task gets the Python file from its deployed location in the Azure Databricks workspace.

resources:
  jobs:
    my-python-script-job:
      name: my-python-script-job

      tasks:
        - task_key: my-python-script-task
          spark_python_task:
            python_file: ./my-script.py

For additional mappings that you can set for this task, see tasks > spark_python_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See also Python script task for jobs.

Python wheel task

You use this task to run a Python wheel file.

The following example adds a Python wheel task to a job. The path for the Python wheel file to deploy is relative to the configuration file in which this task is declared. See Databricks Asset Bundles library dependencies.

resources:
  jobs:
    my-python-wheel-job:
      name: my-python-wheel-job
      tasks:
        - task_key: my-python-wheel-task
          python_wheel_task:
            entry_point: run
            package_name: my_package
          libraries:
            - whl: ./my_package/dist/my_package-*.whl

For additional mappings that you can set for this task, see tasks > python_wheel_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See also Develop a Python wheel file using Databricks Asset Bundles and Python Wheel task for jobs.

JAR task

You use this task to run a JAR. You can reference local JAR libraries or those in a workspace, a Unity Catalog volume, or an external cloud storage location. See Databricks Asset Bundles library dependencies.

The following example adds a JAR task to a job. The path for the JAR is to the specified volume location.

resources:
  jobs:
    my-jar-job:
      name: my-jar-job
      tasks:
        - task_key: my-jar-task
          spark_jar_task:
            main_class_name: org.example.com.Main
          libraries:
            - jar: /Volumes/main/default/my-volume/my-project-0.1.0-SNAPSHOT.jar

For additional mappings that you can set for this task, see tasks > spark_jar_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See JAR task for jobs.

SQL file task

You use this task to run a SQL file located in a workspace or a remote Git repository.

The following example adds a SQL file task to a job. This SQL file task uses the specified SQL warehouse to run the specified SQL file.

resources:
  jobs:
    my-sql-file-job:
      name: my-sql-file-job
      tasks:
        - task_key: my-sql-file-task
          sql_task:
            file:
              path: /Users/someone@example.com/hello-world.sql
              source: WORKSPACE
            warehouse_id: 1a111111a1111aa1

To get a SQL warehouse’s ID, open the SQL warehouse’s settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

For additional mappings that you can set for this task, see tasks > sql_task > file in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See SQL task for jobs.

Delta Live Tables pipeline task

You use this task to run a Delta Live Tables pipeline. See What is Delta Live Tables?.

The following example adds a Delta Live Tables pipeline task to a job. This Delta Live Tables pipeline task runs the specified pipeline.

resources:
  jobs:
    my-pipeline-job:
      name: my-pipeline-job
      tasks:
        - task_key: my-pipeline-task
          pipeline_task:
            pipeline_id: 11111111-1111-1111-1111-111111111111

You can get a pipelines’s ID by opening the pipeline in the workspace and copying the Pipeline ID value on the Pipeline details tab of the pipeline’s settings page.

For additional mappings that you can set for this task, see tasks > pipeline_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See Delta Live Tables pipeline task for jobs.

dbt task

You use this task to run one or more dbt commands. See Connect to dbt Cloud.

The following example adds a dbt task to a job. This dbt task uses the specified SQL warehouse to run the specified dbt commands.

resources:
  jobs:
    my-dbt-job:
      name: my-dbt-job
      tasks:
        - task_key: my-dbt-task
          dbt_task:
            commands:
              - "dbt deps"
              - "dbt seed"
              - "dbt run"
            project_directory: /Users/someone@example.com/Testing
            warehouse_id: 1a111111a1111aa1
          libraries:
            - pypi:
                package: "dbt-databricks>=1.0.0,<2.0.0"

To get a SQL warehouse’s ID, open the SQL warehouse’s settings page, then copy the ID found in parentheses after the name of the warehouse in the Name field on the Overview tab.

For additional mappings that you can set for this task, see tasks > dbt_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format. See dbt task for jobs.

Databricks Asset Bundles also includes a dbt-sql project template that defines a job with a dbt task, as well as dbt profiles for deployed dbt jobs. For information about Databricks Asset Bundles templates, see Use a default bundle template.

Run job task

You use this task to run another job.

The following example contains a run job task in the second job that runs the first job.

resources:
  jobs:
    my-first-job:
      name: my-first-job
      tasks:
        - task_key: my-first-job-task
          new_cluster:
            spark_version: "13.3.x-scala2.12"
            node_type_id: "i3.xlarge"
            num_workers: 2
          notebook_task:
            notebook_path: ./src/test.py
    my_second_job:
      name: my-second-job
      tasks:
        - task_key: my-second-job-task
          run_job_task:
            job_id: ${resources.jobs.my-first-job.id}

This example uses a substitution to retrieve the ID of the job to run. To get a job’s ID from the UI, open the job in the workspace and copy the ID from the Job ID value in the Job details tab of the jobs’s settings page.

For additional mappings that you can set for this task, see tasks > run_job_task in the create job operation’s request payload as defined in POST /api/2.1/jobs/create in the REST API reference, expressed in YAML format.

Κοινή χρήση μέσω