Understanding Node States, Health, and Operations
Applies To: Windows HPC Server 2008
Node state reflects a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs. An administrator brings a node Online to indicate that the node should accept jobs or client requests.
Node health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. If the node has a node health value of Unreachable or Failed Provisioning, the node will not be able to accept jobs or client requests, even if the node state is Online.
During normal operations, compute nodes and Windows Communication Foundation (WCF) broker nodes have a node health value of OK, and a node state value of Online.
During normal operations, the head node has a node health value of OK, and a node state value of Offline. If the head node is also acting as a compute node or a WCF broker node, or if a head node is installed for high-availability, then its normal node state value is Online.
Part of the process of monitoring and maintaining cluster health is finding deviances from the normal node state and health, and monitoring the state of cluster operations.
The sections in this topic describe the available values for:
Node states
Node health
Operation states
Node states
Node states reflect a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs.
When the head node first detects a node on the network, the node appears in the Unknown state. When an administrator adds a node to the cluster by applying a node template, the node moves to the Provisioning state. When the node has successfully joined the cluster, it moves to the Offline state.
An administrator brings a node Online or takes a node Offline to indicate whether or not the nodes should accept and run cluster jobs. The HPC Job Scheduler Service will only try to start new jobs on nodes that are in the Online state. To make a node unavailable for new jobs, administrators can take the node Offline. Nodes must be in the Offline state to run some management actions, such as Reimage or Maintain.
You can use the node List view to display the state of each node and filter compute nodes by node state. For more information about the node List view, see Understanding Node List and Heat Map Views.
The following table describes node state values:
Node State | Description |
---|---|
Online |
This state indicates that the node should accept and run cluster jobs. The HPC Job Scheduler Service will only try to start new jobs on nodes that are in the Online state. A node must be in the Online state and healthy to run jobs. If the node health is Unreachable or Failed Provisioning, jobs will not be able to start on that node. The Online state is the normal operating node state for:
|
Offline |
This state allows a cluster administrator to run scripts, install software, and perform other tasks on the node. This is the default state of a compute or WCF broker node after a cluster administrator has approved the node for inclusion in the cluster. This is also the default state for a head node (unless it is installed for high availability). If a node is taken offline while running jobs, it will first move through the Draining state. If an administrator chooses to force the node offline immediately, any running tasks will be canceled and requeued within their job. |
Unknown |
This state indicates that the node is not part of the cluster, or that a provisioning operation has failed on that node. To join a node to the cluster, apply the Assign Node Template action to the node. In a high availability cluster, after setup is run on the first head node, the second head node will be in the Unknown state until setup is run on that node. After setup, the second head node moves to the Online state. |
Provisioning |
This state indicates that the node is being configured as a compute node. The Assign Node Template, Reimage, and Maintain actions also put a node into the provisioning state. After provisioning is complete, the node goes to the Offline state. |
Starting |
This state indicates that the node is transitioning from the Offline mode to the Online mode. |
Draining |
This state indicates that the compute node has been taken offline and is transitioning to the Offline state. The node completes currently running jobs before going to the Offline state. Draining nodes do not accept new jobs. |
Removing |
This state indicates that information about the node is being removed from the HPC Node Management Services database on the head node. Nothing is changed on the deleted node itself. The Delete action puts a node into this state. If the node tries to rejoin the cluster, a new entry will be created for that node in the database, and the node will appear in the Unknown state. |
Rejected |
This state indicates that the node was rejected by a cluster administrator. A node in the Rejected state cannot join the cluster. |
Node Health
Node health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. Node health is associated with icons that appear next to the node name in Node list view. When node health is OK, no icon appears next to the node name.
The following table lists the icons, and the alert types and health states that are indicated by the icons.
Icon | Alert type | Associated health state |
---|---|---|
Error |
Unreachable Failed Provisioning |
|
Warning |
Failed Diagnostics |
|
Pending Operation |
Ongoing Operation |
You can use the node List view to display the health of each compute node and filter nodes by node health. For more information about the node List view, see Understanding Node List and Heat Map Views.
The following table describes node health values:
Node Health | Description |
---|---|
OK |
This value indicates that the HPC services are not aware of any problems with the node. |
Unreachable |
This value indicates that the HPC Job Scheduler Service cannot contact the node. The HPC Job Scheduler Service sends regular health probes to the HPC Node Manager Service on each node. If a compute node does not reply to the health probe, it has missed a heartbeat. If a node misses too many heartbeats, the HPC Job Scheduler Service flags the node as Unreachable. The following HPC Job Scheduler property settings apply to the health probes:
A node can become unreachable for many reasons, including:
For troubleshooting information, see Node Health is Flagged as Unreachable. |
Ongoing Operation |
This value indicates that the node is performing an operation that a cluster administrator initiated, such as:
If the operation can be performed while the node is Online, then the HPC Job Scheduler Service can continue to run jobs on the node. |
Diagnostics Failed |
This value indicates that a cluster administrator ran diagnostic tests on the node, and one or more tests returned a result of Failure or Failed to Run. The node leaves this health state if:
|
Provisioning Failed |
This value indicates that the most recent provisioning operation failed. The node will also be in an Unknown state at this point. |
Operation states
For information about how to view the operations log, see Read the Operations Log.
The following table describes the operation state values:
Operation State | Description |
---|---|
Archived |
This state indicates that the operation is more than 24 hours old or the diagnostics test has been cleared. When an operation is archived, it is removed from other status reports. |
Committed |
This state indicates that the operation completed successfully. |
Executing |
This state indicates that the operation is in progress. |
Failed |
This state indicates that the operation failed to execute, is being reverted, or failed to revert. |
Reverted |
This state indicates that the operation reverted after failure or cancellation. |