HPCv2RC1: Nodes imaged successfully, but show “Provisioning Failed”
I just installed the brand new RC1 candidate on my head node. I provisioned all 23 nodes of my cluster at once, using the Default compute node template. While going through the “Create a node template” wizard in the To-Do list, I selected the option “Include a step in the template to download and install updates for my cluster using Microsoft Update or the Enterprise Windows Server Update Services (WSUS)."
I monitored the progress of the nodes by looking at the “Provisioning Log” tab of each node in the “Node Management” view of the HPC Cluster Manager console. Everything went smoothly through the operating system installation phase. All nodes downloaded and installed an operating system image. To my dismay, however, every node then failed the provisioning process at the same step: “Windows update failed to find updates. Exception from HRESULT: 0x8024402C -2145107924".
For a host of reasons, not the least of which are external connectivity issues or misconfiguration of the head node, compute nodes may be unable to locate the WSUS server on first try, and will retry. However, after 3 failures to locate WSUS, the overall provisioning task will be deemed a failure. In this case, I was left with all nodes in the “Unknown” state with Node Health set to “Provisioning Failed.”
At this stage, I knew of two obvious options:
1.) I could diagnose and fix whatever connectivity issues were affecting WSUS and retry the existing provisioning process
2.) I could modify the default compute node template to eliminate the WSUS task, reprovision all nodes, and then once the provisioning completed successfully apply a second template that just handled WSUS.
However, there is a third option that eliminates the need to do a full reprovisioning. With this option, I was able to salvage the operating system installation on each node and bring all of them into a provisioned state. That option? Delete the compute node from the HPC Cluster Manager.
When you delete a compute node that otherwise has a healthy installation of the HPC node manager and related services, the compute node will attempt to re-connect to the head node. This process may take several minutes, but eventually the node will show up once again in the HPC Cluster Manager console, with a status of “Unknown” and a “Node Health” of "OK". At this point the two options become:
1.) Diagnose and fix connectivity issues, re-apply the same provisioning template. This time, the node manager will be smart-enough to recognize that the node is already installed, and will skip over the installation steps – saving time and bandwidth.
2.) Modify the existing template to remove the WSUS step and/or create and apply a non-imaging node template.
I chose the second option and successfully provisioned my nodes. Next step: chase down the source of my connectivity issues…