Share via


Monitoring Nodes

Applies To: Windows HPC Server 2008

A key step in monitoring and maintaining cluster health is to identify any deviance from normal operational state or performance. HPC Cluster Manager enables you to view cluster and node status at a glance, identify problem nodes, and drill down into node details for further investigation.

Note

HPC Cluster Manager provides several charts and reports to monitor and analyze cluster resource usage and job and node statistics. For more information, see Understanding Charts and Reports.

  • View cluster status at a glance

  • Drill down into individual node details

  • Monitor node operations

  • Correlate the monitoring information between nodes, jobs, operations, and diagnostics

View cluster status at a glance

In Node Management you can monitor your cluster at a glance using the node List view or the node Heat Map view. For more information, see:

Drill down into individual node details

The List and Heat Map views provide a starting point for identifying problem areas. Double-click a compute node to see detailed information such as hardware, operating system properties, and current performance metrics. You can also select one or more nodes, then drill down into the node details to investigate performance.

Monitor node operations

Tracking recent or ongoing cluster operations is another monitoring aspect that is critical to administrating a cluster. Windows HPC Server 2008 archives recent operations and allows you to view the progress of ongoing operations in real time. For more information, see:

Correlate the monitoring information between nodes, jobs, operations, and diagnostics

In HPC Job Manager, you can use the Pivot To actions to correlate the monitoring information between nodes, jobs, operations, and diagnostics. For example, you can select one or more nodes in the views pane, and then pivot to the Jobs for the Selected Nodes. This takes you to a job list view that is filtered by the nodes that you selected.

The supported pivot paths are:

  • Nodes: pivot to jobs, test results, and operations.

  • Jobs: pivot to nodes.

  • Test results: pivot to failed nodes, and operations.

In this section