Updating from Jobs API 2.0 to 2.1

Artículo
12/16/2024

You can now orchestrate multiple tasks with Azure Databricks jobs. This article details changes to the Jobs API that support jobs with multiple tasks and has guidance to help you update your existing API clients to work with this new feature.

Databricks recommends Jobs API 2.1 for your API scripts and clients, particularly when using jobs with multiple tasks.

This article refers to jobs defined with a single task as single-task format and jobs defined with multiple tasks as multi-task format.

Jobs API 2.0 and 2.1 now support the update request. Use the update request to change an existing job instead of the reset request to minimize changes between single-task format jobs and multi-task format jobs.

API changes

The Jobs API now defines a TaskSettings object to capture settings for each task in a job. For multi-task format jobs, the tasks field, an array of TaskSettings data structures, is included in the JobSettings object. Some fields previously part of JobSettings are now part of the task settings for multi-task format jobs. JobSettings is also updated to include the format field. The format field indicates the format of the job and is a STRING value set to SINGLE_TASK or MULTI_TASK.

You need to update your existing API clients for these changes to JobSettings for multi-task format jobs. See the API client guide for more information on required changes.

Jobs API 2.1 supports the multi-task format. All API 2.1 requests must conform to this format, and responses are structured in this format.

Jobs API 2.0 is updated with an additional field to support multi-task format jobs. Except where noted, the examples in this document use API 2.0. However, Databricks recommends API 2.1 for new and existing API scripts and clients.

An example JSON document representing a multi-task format job for API 2.0 and 2.1:

{
  "job_id": 53,
  "settings": {
    "name": "A job with multiple tasks",
    "email_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "tasks": [
      {
        "task_key": "clean_data",
        "description": "Clean and prepare the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/clean-data"
        },
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      },
      {
        "task_key": "analyze_data",
        "description": "Perform an analysis of the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/analyze-data"
        },
        "depends_on": [
          {
            "task_key": "clean_data"
          }
        ],
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      }
    ],
    "format": "MULTI_TASK"
  },
  "created_time": 1625841911296,
  "creator_user_name": "user@databricks.com",
  "run_as_user_name": "user@databricks.com"
}

Jobs API 2.1 supports configuration of task level clusters or one or more shared job clusters:

A task level cluster is created and started when a task starts and terminates when the task completes.
A shared job cluster allows multiple tasks in the same job to use the cluster. The cluster is created and started when the first task using the cluster starts and terminates after the last task using the cluster completes. A shared job cluster is not terminated when idle but terminates only after all tasks using it are complete. Multiple non-dependent tasks sharing a cluster can start at the same time. If a shared job cluster fails or is terminated before all tasks have finished, a new cluster is created.

To configure shared job clusters, include a JobCluster array in the JobSettings object. You can specify a maximum of 100 clusters per job. The following is an example of an API 2.1 response for a job configured with two shared clusters:

Note

If a task has library dependencies, you must configure the libraries in the task field settings; libraries cannot be configured in a shared job cluster configuration. In the following example, the libraries field in the configuration of the ingest_orders task demonstrates specification of a library dependency.

{
  "job_id": 53,
  "settings": {
    "name": "A job with multiple tasks",
    "email_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "job_clusters": [
      {
        "job_cluster_key": "default_cluster",
        "new_cluster": {
          "spark_version": "7.3.x-scala2.12",
          "node_type_id": "i3.xlarge",
          "spark_conf": {
            "spark.speculation": true
          },
          "aws_attributes": {
            "availability": "SPOT",
            "zone_id": "us-west-2a"
          },
          "autoscale": {
            "min_workers": 2,
            "max_workers": 8
          }
        }
      },
      {
        "job_cluster_key": "data_processing_cluster",
        "new_cluster": {
          "spark_version": "7.3.x-scala2.12",
          "node_type_id": "r4.2xlarge",
          "spark_conf": {
            "spark.speculation": true
          },
          "aws_attributes": {
            "availability": "SPOT",
            "zone_id": "us-west-2a"
          },
          "autoscale": {
            "min_workers": 8,
            "max_workers": 16
          }
        }
      }
    ],
    "tasks": [
      {
        "task_key": "ingest_orders",
        "description": "Ingest order data",
        "depends_on": [ ],
        "job_cluster_key": "auto_scaling_cluster",
        "spark_jar_task": {
          "main_class_name": "com.databricks.OrdersIngest",
          "parameters": [
            "--data",
            "dbfs:/path/to/order-data.json"
          ]
        },
        "libraries": [
          {
            "jar": "dbfs:/mnt/databricks/OrderIngest.jar"
          }
        ],
        "timeout_seconds": 86400,
        "max_retries": 3,
        "min_retry_interval_millis": 2000,
        "retry_on_timeout": false
      },
      {
        "task_key": "clean_orders",
        "description": "Clean and prepare the order data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/clean-data"
        },
        "job_cluster_key": "default_cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      },
      {
        "task_key": "analyze_orders",
        "description": "Perform an analysis of the order data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/analyze-data"
        },
        "depends_on": [
          {
            "task_key": "clean_data"
          }
        ],
        "job_cluster_key": "data_processing_cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      }
    ],
    "format": "MULTI_TASK"
  },
  "created_time": 1625841911296,
  "creator_user_name": "user@databricks.com",
  "run_as_user_name": "user@databricks.com"
}

For single-task format jobs, the JobSettings data structure remains unchanged except for the addition of the format field. No TaskSettings array is included, and the task settings remain defined at the top level of the JobSettings data structure. You will not need to make changes to your existing API clients to process single-task format jobs.

An example JSON document representing a single-task format job for API 2.0:

{
  "job_id": 27,
  "settings": {
    "name": "Example notebook",
    "existing_cluster_id": "1201-my-cluster",
    "libraries": [
      {
        "jar": "dbfs:/FileStore/jars/spark_examples.jar"
      }
    ],
    "email_notifications": {},
    "timeout_seconds": 0,
    "schedule": {
      "quartz_cron_expression": "0 0 0 * * ?",
      "timezone_id": "US/Pacific",
      "pause_status": "UNPAUSED"
    },
    "notebook_task": {
      "notebook_path": "/notebooks/example-notebook",
      "revision_timestamp": 0
    },
    "max_concurrent_runs": 1,
    "format": "SINGLE_TASK"
  },
  "created_time": 1504128821443,
  "creator_user_name": "user@databricks.com"
}

API client guide

This section provides guidelines, examples, and required changes for API calls affected by the new multi-task format feature.

Create

To create a single-task format job through the Create a new job operation (POST /jobs/create) in the Jobs API, you do not need to change existing clients.

To create a multi-task format job, use the tasks field in JobSettings to specify settings for each task. The following example creates a job with two notebook tasks. This example is for API 2.0 and 2.1:

Note

A maximum of 100 tasks can be specified per job.

{
  "name": "Multi-task-job",
  "max_concurrent_runs": 1,
  "tasks": [
    {
      "task_key": "clean_data",
      "description": "Clean and prepare the data",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/clean-data"
      },
      "existing_cluster_id": "1201-my-cluster",
      "timeout_seconds": 3600,
      "max_retries": 3,
      "retry_on_timeout": true
    },
    {
      "task_key": "analyze_data",
      "description": "Perform an analysis of the data",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/analyze-data"
      },
      "depends_on": [
        {
          "task_key": "clean_data"
        }
      ],
      "existing_cluster_id": "1201-my-cluster",
      "timeout_seconds": 3600,
      "max_retries": 3,
      "retry_on_timeout": true
    }
  ]
}

Runs submit

To submit a one-time run of a single-task format job with the Create and trigger a one-time run operation (POST /runs/submit) in the Jobs API, you do not need to change existing clients.

To submit a one-time run of a multi-task format job, use the tasks field in JobSettings to specify settings for each task, including clusters. Clusters must be set at the task level when submitting a multi-task format job because the runs submit request does not support shared job clusters. See Create for an example JobSettings specifying multiple tasks.

Update

To update a single-task format job with the Partially update a job operation (POST /jobs/update) in the Jobs API, you do not need to change existing clients.

To update the settings of a multi-task format job, you must use the unique task_key field to identify new task settings. See Create for an example JobSettings specifying multiple tasks.

Reset

To overwrite the settings of a single-task format job with the Overwrite all settings for a job operation (POST /jobs/reset) in the Jobs API, you do not need to change existing clients.

To overwrite the settings of a multi-task format job, specify a JobSettings data structure with an array of TaskSettings data structures. See Create for an example JobSettings specifying multiple tasks.

Use Update to change individual fields without switching from single-task to multi-task format.

List

For single-task format jobs, no client changes are required to process the response from the List all jobs operation (GET /jobs/list) in the Jobs API.

For multi-task format jobs, most settings are defined at the task level and not the job level. Cluster configuration may be set at the task or job level. To modify clients to access cluster or task settings for a multi-task format job returned in the Job structure:

Parse the job_id field for the multi-task format job.
Pass the job_id to the Get a job operation (GET /jobs/get) in the Jobs API to retrieve job details. See Get for an example response from the Get API call for a multi-task format job.

The following example shows a response containing single-task and multi-task format jobs. This example is for API 2.0:

{
  "jobs": [
    {
      "job_id": 36,
      "settings": {
        "name": "A job with a single task",
        "existing_cluster_id": "1201-my-cluster",
        "email_notifications": {},
        "timeout_seconds": 0,
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/example-notebook",
          "revision_timestamp": 0
        },
        "max_concurrent_runs": 1,
        "format": "SINGLE_TASK"
      },
      "created_time": 1505427148390,
      "creator_user_name": "user@databricks.com"
    },
    {
      "job_id": 53,
      "settings": {
        "name": "A job with multiple tasks",
        "email_notifications": {},
        "timeout_seconds": 0,
        "max_concurrent_runs": 1,
        "format": "MULTI_TASK"
      },
      "created_time": 1625841911296,
      "creator_user_name": "user@databricks.com"
    }
  ]
}

Get

For single-task format jobs, no client changes are required to process the response from the Get a job operation (GET /jobs/get) in the Jobs API.

Multi-task format jobs return an array of task data structures containing task settings. If you require access to task level details, you need to modify your clients to iterate through the tasks array and extract required fields.

The following shows an example response from the Get API call for a multi-task format job. This example is for API 2.0 and 2.1:

{
  "job_id": 53,
  "settings": {
    "name": "A job with multiple tasks",
    "email_notifications": {},
    "timeout_seconds": 0,
    "max_concurrent_runs": 1,
    "tasks": [
      {
        "task_key": "clean_data",
        "description": "Clean and prepare the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/clean-data"
        },
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      },
      {
        "task_key": "analyze_data",
        "description": "Perform an analysis of the data",
        "notebook_task": {
          "notebook_path": "/Users/user@databricks.com/analyze-data"
        },
        "depends_on": [
          {
            "task_key": "clean_data"
          }
        ],
        "existing_cluster_id": "1201-my-cluster",
        "max_retries": 3,
        "min_retry_interval_millis": 0,
        "retry_on_timeout": true,
        "timeout_seconds": 3600,
        "email_notifications": {}
      }
    ],
    "format": "MULTI_TASK"
  },
  "created_time": 1625841911296,
  "creator_user_name": "user@databricks.com",
  "run_as_user_name": "user@databricks.com"
}

Runs get

For single-task format jobs, no client changes are required to process the response from the Get a job run operation (GET /jobs/runs/get) in the Jobs API.

The response for a multi-task format job run contains an array of TaskSettings. To retrieve run results for each task:

Iterate through each of the tasks.
Parse the run_id for each task.
Call the Get the output for a run operation (GET /jobs/runs/get-output) with the run_id to get details on the run for each task. The following is an example response from this request:

{
  "job_id": 53,
  "run_id": 759600,
  "number_in_job": 7,
  "original_attempt_run_id": 759600,
  "state": {
    "life_cycle_state": "TERMINATED",
    "result_state": "SUCCESS",
    "state_message": ""
  },
  "cluster_spec": {},
  "start_time": 1595943854860,
  "setup_duration": 0,
  "execution_duration": 0,
  "cleanup_duration": 0,
  "trigger": "ONE_TIME",
  "creator_user_name": "user@databricks.com",
  "run_name": "Query logs",
  "run_type": "JOB_RUN",
  "tasks": [
    {
      "run_id": 759601,
      "task_key": "query-logs",
      "description": "Query session logs",
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/log-query"
      },
      "existing_cluster_id": "1201-my-cluster",
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      }
    },
    {
      "run_id": 759602,
      "task_key": "validate_output",
      "description": "Validate query output",
      "depends_on": [
        {
          "task_key": "query-logs"
        }
      ],
      "notebook_task": {
        "notebook_path": "/Users/user@databricks.com/validate-query-results"
      },
      "existing_cluster_id": "1201-my-cluster",
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      }
    }
  ],
  "format": "MULTI_TASK"
}

Runs get output

For single-task format jobs, no client changes are required to process the response from the Get the output for a run operation (GET /jobs/runs/get-output) in the Jobs API.

For multi-task format jobs, calling Runs get output on a parent run results in an error since run output is available only for individual tasks. To get the output and metadata for a multi-task format job:

Call the Get the output for a run request.
Iterate over the child run_id fields in the response.
Use the child run_id values to call Runs get output.

Runs list

For single-task format jobs, no client changes are required to process the response from the List runs for a job operation (GET /jobs/runs/list).

For multi-task format jobs, an empty tasks array is returned. Pass the run_id to the Get a job run operation (GET /jobs/runs/get) to retrieve the tasks. The following shows an example response from the Runs list API call for a multi-task format job:

{
  "runs": [
    {
      "job_id": 53,
      "run_id": 759600,
      "number_in_job": 7,
      "original_attempt_run_id": 759600,
      "state": {
        "life_cycle_state": "TERMINATED",
        "result_state": "SUCCESS",
        "state_message": ""
      },
      "cluster_spec": {},
      "start_time": 1595943854860,
      "setup_duration": 0,
      "execution_duration": 0,
      "cleanup_duration": 0,
      "trigger": "ONE_TIME",
      "creator_user_name": "user@databricks.com",
      "run_name": "Query logs",
      "run_type": "JOB_RUN",
      "tasks": [],
      "format": "MULTI_TASK"
    }
  ],
  "has_more": false
}

Compartir a través de