Understanding Node States, Health, and Operations

 

Applies To: Microsoft HPC Pack 2008 R2, Microsoft HPC Pack 2012, Microsoft HPC Pack 2012 R2

Node State reflects a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs. An administrator brings a node to the Online state to indicate that the node should accept jobs or client requests.

Node Health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node. If the node has a node health value of Error, the node will not be able to accept jobs or client requests, even if the node state is Online.

During normal operations, nodes have a node health value of OK. The following list describes the normal node state values:

  • The head node has a node state value of Offline. If the head node is also acting as a compute node or a WCF broker node, or if a head node is installed for high-availability, then its normal node state value is Online.

  • Compute nodes and Windows Communication Foundation (WCF) broker nodes have a node state value of Online.

  • Workstation nodes can have a node state value of Online or Offline, according to the availability policy.

  • Windows Azure nodes that are defined but not deployed in Windows Azure have a normal node state value of Not Deployed. Windows Azure nodes that are deployed have a normal node state value of Online.

Part of the process of monitoring and maintaining cluster health is finding deviances from the normal node state and health, and monitoring the state of cluster operations.

The sections in this topic describe the values for:

  • Node states

  • Node Health

  • Operation states

Node states

Node states reflect a node’s deployment state, and whether or not an administrator wants the node to be available as a resource for cluster jobs.

When the head node first detects an on-premises node on the network, the node appears in the Unknown state. When an administrator adds a node to the cluster by assigning a node template, the node moves to the Provisioning state. When the node has successfully joined the cluster, it moves to the Offline state.

When an administrator adds Windows Azure nodes to the cluster, they appear in the Not Deployed state. When the Windows Azure nodes are started (which means that the instances are deployed in Windows Azure), the nodes move to the Provisioning state. After provisioning completes successfully, a Windows Azure node that is manually started goes to the Offline state, and a Windows Azure node that is started automatically goes to the Online state.

When an administrator adds workstation nodes and unmanaged server nodes to the cluster, and after the node template is assigned, they can be brought online to run cluster jobs, and then taken offline to resume their normal workloads. Nodes that are configured in the node template to be brought online and offline manually will initially be offline. The nodes that are configured to be brought online and offline according to a weekly availability policy will begin to follow that policy, and they will be brought online automatically during the scheduled intervals.

An administrator brings a node Online or takes a node Offline to indicate whether or not the nodes should accept and run cluster jobs. Windows Azure nodes and Workstation Nodes can also be brought Online or Offline according to a weekly availability policy. The HPC Job Scheduler Service will only try to start new jobs on nodes that are in the Online state. To make a node unavailable for new jobs, administrators can take the node Offline. Nodes must be in the Offline state to run some management actions, such as Reimage or Maintain.

You can use the node list view to display the state of each node and filter compute nodes by node state.

The following table describes node state values:

Node State

Description

Online

This state indicates that the node should accept and run cluster jobs. For WCF Broker Nodes, this state indicates that they should be available to manage SOA sessions. The HPC Job Scheduler Service will only try to allocate work to nodes that are in the Online state.

A node must be in the Online node state and healthy to run jobs (or manage sessions). If the node health is Error, jobs will not be able to start on that node.

Nodes can be brought Online or Offline by the cluster administrator. Windows Azure nodes, workstation nodes, and unmanaged server nodes can also be brought Online or Offline according to a weekly availability policy.

Offline

This state indicates that the node should not be used to run cluster jobs. For WCF Broker Nodes, this indicates it should not be used to manage SOA sessions. This state allows a cluster administrator to run scripts, install software, and perform other tasks on the node. This is the default state of a node after a cluster administrator has approved the node for inclusion in the cluster.

This is the normal state for a head node (unless it is installed for high availability). You can bring a head node Online if you want it to perform additional node roles, such as Compute Node or WCF Broker Node. For more information, see Understanding Node Roles in Microsoft HPC Pack.

Nodes can be brought Online or Offline by the cluster administrator. Windows Azure nodes, workstation nodes, and unmanaged server nodes can also be brought Online or Offline according to a weekly availability policy.

If a node is taken offline while running jobs, it will first move through the Draining state. If an administrator chooses to force the node offline immediately, any running tasks will be canceled and requeued within their job.

Unknown

This state indicates that the node is not part of the cluster, or that a provisioning operation has failed on that node.

To join a node to the cluster, apply the Assign Node Template action to the node.

In a high availability cluster, after setup is run on the first head node, the second head node will be in the Unknown state until setup is run on that node. After setup, the second head node moves to the Online state.

Provisioning

On-premises nodes

This state indicates that the node is being configured as a cluster node. The Assign Node Template, Reimage, and Maintain actions also put a node into the provisioning state. After provisioning is complete, the node goes to the Offline state.

Windows Azure nodes

This state indicates that the node instance is being deployed in Windows Azure. The Start action or an automatic availability policy can put a Windows Azure node into the provisioning state. After provisioning completes successfully, a Windows Azure Node that is manually started goes to the Offline state, and a Windows Azure node that is started automatically goes to the Online state.

Starting

This state indicates that the node is transitioning from the Offline mode to the Online mode.

Note

The Start action does not put nodes in the Starting state. The Start action applies only to Windows Azure nodes, and is used to deploy node instances in Windows Azure. When the Start action is applied, nodes go into the Provisioning state.

Draining

This state indicates that the node has been taken offline and is transitioning to the Offline state. The node completes currently running jobs before going to the Offline state. Draining nodes do not accept new jobs.

Removing

This state indicates that information about the node is being removed from the HPC Node Management Services database. The Delete action puts a node into this state. Nothing is changed on the deleted node itself.

If the node tries to rejoin the cluster, a new entry will be created for that node in the database, and the node will appear in the Unknown state.

Rejected

This state indicates that the node was rejected by a cluster administrator.

Not-Deployed

This state only applies to Windows Azure nodes.

This state indicates that the Windows Azure node has been defined and added to the cluster, but the node has not been started and provisioned in Windows Azure (the node instance has not been created in Windows Azure). Windows Azure nodes are deployed according to the availability policy that is defined in the node template: manually (with the Start action), or automatically based on a weekly schedule.

Windows Azure nodes in the Not-Deployed state do not incur charges in Windows Azure.

Stopping

This state only applies to Windows Azure nodes.

This state indicates that the Windows Azure node instance is being removed from Windows Azure. Windows Azure nodes are stopped according to the availability policy that is defined in the node template: manually (with the Stop action), or automatically based on a weekly schedule.

When the stop operations are complete (the node instance is removed from Windows Azure), the node goes to the Not-Deployed state.

Node Health

Node Health indicates whether or not there are any warnings or errors that the HPC services are aware of on that node.

You can use the node list view to display the health of each compute node and filter nodes by node health. If the node health is Error or Warning, review the information on the Node Health tab for additional details. You can view the Node Health tab in the Details Pane (in list view), or by double-clicking a node.

The following table describes node health values:

Node Health

Description

OK

The HPC services are not aware of any problems with the node.

Warning 

This value can indicate the following:

  • A cluster administrator ran diagnostic tests on the node, and one or more tests returned a result of Failure or Failed to Run. An administrator can manually clear the diagnostic alerts (see Resolve and Clear Diagnostic Alerts).

  • One or more node operations are in the Failed, Reverted, or Canceled state. Read the Operations Log to investigate the issue.

Review the information in the Node Health tab to begin investigating the issue.

Error

This value can indicate the following:

  • The node is not reachable, as determined by the Heartbeat options.

  • Provisioning failed.

  • The node was rejected by a cluster administrator. (You can assign a node template if you decide to join the node to the cluster.)

Review the information in the Node Health tab to begin investigating the issue.

Transitional

This value indicates that the node is performing an operation that a cluster administrator initiated, such as:

  • Assign Node Template, Reimage, or Maintain (in which case the Node State is Provisioning).

  • Bring Online (in which case the Node State is Starting).

  • Take Offline (in which case the Node State is Draining).

  • Start for Windows Azure nodes (in which case the Node State is Provisioning).

View the Node Health tab for additional information or to cancel the operation.

Unapproved

On-premises nodes

The node has been detected by the head node, but it is not part of the cluster. Assign a node template to join the node to the cluster. See also Adding Nodes to a Cluster.

Windows Azure nodes

The node has been added to the cluster, but the node has not been started and provisioned in Windows Azure (the node instance does not exist in Windows Azure).

Operation states

For information about how to view the operations log, see Read the Operations Log.

The following table describes the operation state values:

Operation State

Description

Archived

The operation is more than 24 hours old or the diagnostics test has been cleared. When an operation is archived, it is removed from other status reports.

Committed

The operation completed successfully.

Executing

The operation is in progress.

Failed

The operation failed to execute.

Reverting

The operation is being reverted. When clean-up of the operation is complete, the operation will move to the Reverted state.

Failed to Revert

Clean-up of the operation was not successful.

Reverted

The operation reverted after failure or cancellation.

Additional references