Understanding Job and Task States
In HPC Pack, jobs and tasks have almost identical life cycle states. The main life cycle states are Configuring, Queued, Running, Finished, Failed, and Canceled. Jobs and tasks also move through brief transitional states. The following table summarizes all life cycle states.
Job and task states
State | Definition |
---|---|
Configuring | The job or task is in the system, but has not been submitted to the queue. |
Submitted | The job or task has been submitted and is awaiting validation before it can be queued. |
Validating | The HPC Job Scheduler Service is validating the job or task. During validation, the HPC Job Scheduler Service confirms permissions, applies default settings for any properties that the job owner did not specify, and validates each property against constraints. Default settings and constraints are defined by the job template. For more information about job templates, see Understanding Job Templates - Job Manager. The HPC Job Scheduler Service also confirms that job properties encompass all task properties (for example, no task has a run time that is greater in value than the run time of the job). During validation, the job might also pass through a custom submission filter application that is defined by the cluster administrator. If the job passes validation, it moves to the Queued state. If the job does not pass validation, the job displays an error message and the job moves to the Failed state. |
Queued | The job or task passed validation, and is waiting to be scheduled and activated (run). When a running job, a Basic task, or a Parametric Sweep sub-task is preempted by the HPC Job Scheduler Service, it moves back to the Queued state (unless the task is not rerunnable, in which case it is marked as Failed). Note: In HPC Pack 2012, the default option for preemption behavior in Queued scheduling mode is task-level immediate preemption, rather than job-level preemption. |
Dispatching | This state only applies to tasks. The HPC Job Scheduler Service has allocated resources to the task and is contacting the allocated nodes to start running the task. When the task starts, it moves to the Running state. |
Running | The job or task is running on one or more nodes. |
Finishing | The job or task completed, and job or task clean-up is in progress. |
Finished | The job or task completed successfully. |
Failed | The job or task failed to complete, stopped running, or returned an exit codes that indicates failure (by default, any non-zero exit code). Additionally, a running task is marked as Failed when: - The job owner or a cluster administrator cancels the task. - The HPC Job Scheduler Service cancels a task because it has exceeded its maximum runtime. - The HPC Job Scheduler Service preempts a task that is not marked as rerunnable. - The HPC Job Scheduler Service preempts a sub-task that is started on a per-resource basis (Node Preparation, Node Release, and Service sub-tasks). If a job or task fails to start because of a cluster failure, the job or task is automatically retried a specified number of times before it is marked as Failed. |
Canceling | The job or task was canceled and clean-up is in progress. |
Canceled | The job was canceled by the job owner, a cluster administrator, or by the HPC Job Scheduler Service. For example, the HPC Job Scheduler Service can cancel a job if it exceeds its runtime or if it is preempted. The task was canceled by the job owner or a cluster administrator before it started running. If a running task is canceled, the task is marked as Failed. To cancel a job or task, see Cancel a Job or Task - Job Manager or Force Cancel a Job or Task - Job Manager. |