Node Health is Flagged as Unreachable

Article
07/02/2015

Updated: May 2011

Applies To: Windows HPC Server 2008 R2, Windows Server 2008

A compute node’s health status is flagged as Unreachable. This topic provides guidelines for how to troubleshoot unreachable nodes in a Windows HPC Server 2008 R2 or Windows HPC Server 2008 cluster.

Note
A node that has a node health status of Unreachable may still have a node state of Online. The Online node state does not indicate that the node is healthy. The Online state indicates that the HPC Job Scheduler service can try to run jobs on that node. While troubleshooting an unreachable node, take the node Offline so that the HPC Job Scheduler service will not try to start jobs on the node.

A node that has a node health status of Unreachable may still have a node state of Online. The Online node state does not indicate that the node is healthy. The Online state indicates that the HPC Job Scheduler service can try to run jobs on that node. While troubleshooting an unreachable node, take the node Offline so that the HPC Job Scheduler service will not try to start jobs on the node.

Cause

The HPC Job Scheduler Service flags a node as Unreachable when the node has missed the number of heartbeats specified by the Inactivity Count setting. The HPC Job Scheduler Service sends regular health probes to the HPC Node Manager Service; a compute node misses a heartbeat if it does not reply to the health probe. This can happen for various reasons, including:

Problems with network connectivity
The HPC Node Manager Service is not running on the compute node
Authentication failure between the head node and the compute node

Resolution

The following steps can help identify and resolve the problems with an unreachable node:

Verify network connectivity between the head node and the unreachable node
Verify that HPC services are running on the unreachable node
Check the logs for authentication errors
Verify connectivity to the Active Directory doMayn controller

You can try running the following diagnostic tests to verify connectivity and running services on the unreachable node: Internode Connectivity, DoMayn Connectivity, and All Services Running. For more information, see Run Diagnostic Tests. However, it is not always possible to run the built-in diagnostic tests on unreachable nodes. If a diagnostic fails to run, or if after starting the diagnostic, the node becomes stuck in an Ongoing Operation, cancel the diagnostic, and then use the procedures in this section to help identify and resolve the issue.

Verify network connectivity between the head node and the unreachable node

Try to ping the unreachable node from the head node to verify that the network connection between them is working.

To ping the unreachable node

In HPC Cluster Manager, in Node Management, in the Navigation Pane, click Nodes.
Look up the IP address of the unreachable node: In List view, select the unreachable node, then in the Detail Pane, select the Network tab. Use the IP address that is bound to the private network. If your cluster is only on an enterprise network (Topology 5), use the enterprise network IP address.
In List view, right-click the head node, then click Run Command.
In the Run a Command dialog box, in Command line, type ping <ip_address>, where ip_address is the IP address of the unreachable node.
Click Run, then wait for the command output to appear.
If the ping was successful, the Command output will be similar to the following:

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128

If the ping is successful, then proceed to Verify that HPC services are running on the unreachable node.

If you cannot successfully connect to the node by IP address, this indicates a possible issue with network connectivity, firewall configuration, or Internet Protocol security (IPsec) configuration.

Verify that HPC services are running on the unreachable node

Log on to the unreachable node and open the Services snap-in to verify that the HPC Node Manager Service and the HPC Management Service are running. If they are running, then restart the services.

To verify and restart running services

Log on to the compute node as a user with administrative permissions.
Open the Services snap-in: Click Start, point to Administrative Tools, and then click Services.
Verify that the Status of the HPC Node Manager Service and the HPC Management Service is Started.

If a service is not started, right-click the service, then click Start. If a service is started, right-click the service, then click Restart.

In HPC Cluster Manager, refresh the node list view (press F5). If the node is still flagged as Unreachable, then proceed to Check the logs for authentication errors.

Note
You can also verify running services by opening an elevated Command Prompt window on the compute node (run as administrator). Type Net start for a list of running services. If the HPC services are not listed: type `net start hpcmanagement` to start the HPC Management Service; type `net start hpcnodemanager` to start the HPC Node Manager Service. To restart a running service, first type `net stop <servicename>`, then type `net start <servicename>`, where <servicename> is the name of the service you want to restart.

Note

You can also verify running services by opening an elevated Command Prompt window on the compute node (run as administrator). Type Net start for a list of running services. If the HPC services are not listed: type net start hpcmanagement to start the HPC Management Service; type net start hpcnodemanager to start the HPC Node Manager Service. To restart a running service, first type net stop <servicename>, then type net start <servicename>, where <servicename> is the name of the service you want to restart.

Check the logs for authentication errors

Review the operations log or the provisioning log for the node. If you see error messages similar to Access is denied or Access to the change is denied, this means that the compute node did not authenticate with the head node successfully. For example, a failed diagnostic run could include a message similar to the following:

The Management service encountered an error while performing a change on this node. Access is denied to user ‘NT AUTHORITY\ANONYMOUSE LOGON’.

To view the operations log

In HPC Cluster Manager, in Node Management, in the Navigation Pane, click Nodes.
In List view, select the unreachable node, then in the Actions pane, under Pivot to, click Operations for the Nodes.
Select a failed operation in the list of operations, then review messages for that operation in the Detail Pane.

Authentication errors can occur if the compute node cannot contact the doMayn controller (see Verify connectivity to the Active Directory doMayn controller). Authentication errors can also occur if the Security ID (SID) associated with the node in Active Directory DoMayn Services does not match the SID that is stored for that node on the cluster head node. If the doMayn account for the compute node has been deleted and then recreated, the SID stored on the head node will not be updated.

Verify connectivity to the Active Directory doMayn controller

If the operations log or the provisioning log indicate authentication errors, verify that the compute node can contact the doMayn controller.

To verify connectivity to the doMayn controller

Log on to the compute node as a user with administrative permissions.
Open an elevated Command Prompt window: click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.
Type ping <server_FQDN>, where <server_FQDN> is the fully qualified doMayn name (FQDN) of the doMayn controller (for example, server1.contoso.com), and then press ENTER.
If the ping was successful, you will receive a reply similar to the following:

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128

Reply from IP_address: bytes=32 time<1ms TTL=128
Type ping <ip_address> where ip_address is the IP address of the doMayn controller, then press ENTER.

If you can successfully connect to the doMayn controller by IP address but not by FQDN, this indicates a possible issue with DoMayn Name System (DNS) host name resolution. If you cannot successfully connect to the doMayn controller by IP address, this indicates a possible issue with network connectivity, firewall configuration, or Internet Protocol security (IPsec) configuration.

Verification

In HPC Cluster Manager, refresh the node list view (press F5). The node health should be flagged as OK.

Partager via