Set and Clear Excluded Nodes for Jobs
If you notice that tasks consistently fail on a particular node, you can exclude that node from one or more jobs by adding it to the Excluded Nodes job property. When you specify nodes in the Excluded Nodes:
Tasks in the job that are running on a node that has been added to Excluded Nodes are canceled and marked as Failed (with the exception of Node Release tasks).
Node Release tasks run on the excluded node before the node is released.
No tasks in the job are started on nodes that are listed in Excluded Nodes.
If additions to the Excluded Nodes list cause the job to drop below its minimum resource requirements, the job is canceled and requeued.
For any active job, you can add or remove nodes in the Excluded Nodes jobs property, or clear the list. The following lists the commands to modify and view the Excluded Nodes list using HPC PowerShell or a command prompt.
In HPC PowerShell, use the Set-HpcJobcmdlet, for example:
Set-HpcJob –JobId <yourJobID> /addExludedNodes <nodeName>, <nodename>
Set-HpcJob –JobId <yourJobID> /removeExcludedNodes <nodeName>, <nodename>
Set-HpcJob –JobId <yourJobID> /clearExcludedNodes
(Get-HpcJob –JobId <yourJobID>).ExcludedNodes
Or to view all job properties,
Get-HpcJob –JobId <yourJobID>|fl
At a command prompt, use the job modify command, for example:
job modify <yourJobID> /addExludedNodes <nodeName>, <nodename>
job modify <yourJobID> /removeExcludedNodes <nodeName>, <nodename>
job modify <yourJobID> /clearExcludedNodes
job view <yourJobID> /detailed|find “excludednodes” /i
Or to view all job properties,
job view <yourJobID> /detailed
Note
For SOA jobs, the broker node automatically updates and maintains the list of excluded nodes according to the EndPointNotFoundRetryPeriod setting (in the service configuration file). This setting specifies how long the service host should retry loading the service and how long the broker should wait for a connection. If this time elapses, the broker adds the node (service host) to the Excluded Nodes list. The service configuration also includes the maxExcludedNodes setting that specifies how many nodes can be excluded before the session fails.
Monitoring excluded nodes on the cluster
To see all excluded nodes on a cluster, use the Get-HpcJob PowerShell cmdlet. The following example shows how to list all of the excluded nodes for jobs that were submitted today. The script also lists the job template that was used for the job that excluded the node. In the following cmdlet, <today’s date> is specified in a date format such as mm/dd/yyyy:
Get-HpcJob –beginSubmitDate <today’s date>|select ExcludedNodes, Job Template|sort
If the cluster administrator detects and resolves the issue on one or more nodes, the administrator can remove the fixed node from any node exclusion list in which it appears. The following cmdlet gets all active jobs and removes the fixed nodes from the node exclusion lists (this has no effect on jobs that do not list the specified nodes):
Get-HpcJob|Set-HpcJob –removeExcludedNodes <fixedNodeName>,<fixedNodeName>