HPC Pack 2019 Cluster manager cannot connect to newly installed cluster

muellech 106 Reputation points
2024-10-17T08:46:17.5233333+00:00

I have been upgrading on-premise HPC Pack 2016 to 2019 - basically a reinstall. The database is a completely new SQL Server instance and the scripts ran as expected. The installation went through w/o errors on three head nodes with built-in clustering. However, I cannot connect with the HPC Cluster Manager, presumably because the services are constantly crashing. Errors I see (on all nodes) in the application log include:

Application: HpcDiagnostics.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.ComponentModel.Win32Exception at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+<GetKeyAndSalt>d__23.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+<InitDefault>d__18.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticsStore+<Init>d__27.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+<StartSvc>d__8.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<<OnStart>b__2_1>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness+<>c__DisplayClass45_0.<RunAsync>b__0(System.Object) at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch()

and

Application: HpcScheduler.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.ComponentModel.Win32Exception at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String) at Microsoft.Hpc.Scheduler.SchedulerCrypto+<InitKeyAndSalt>d__29.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.SchedulerCrypto+<InitDefault>d__24.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean, Boolean, System.String, System.String, System.String, System.String, System.Func2<System.String,System.String>, System.Func2<System.String,System.String>) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean) at Microsoft.Hpc.Scheduler.SchedulerSvc+<StartSvc>d__19.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.SchedulerService+<<OnStart>b__5_1>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness+<>c__DisplayClass45_0.<RunAsync>b__0(System.Object) at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch()

and the SCM reports that

The HPC Job Scheduler Service service terminated unexpectedly. It has done this 2300 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC Diagnostics Service service terminated unexpectedly. It has done this 2203 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC Session Service service terminated unexpectedly. It has done this 1843 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC SDM Store Service service terminated unexpectedly. It has done this 1811 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC Monitoring Server Service service terminated unexpectedly. It has done this 1825 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC Reporting Service service terminated unexpectedly. It has done this 1839 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

The HPC Management Service service terminated unexpectedly. It has done this 1837 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.

I have already tried a complete rebuild of the cluster (including the DBs) to no avail. I think the issue might be related to TLS, because the call stack contains that crypto stuff (SchedulerCrypto+<InitKeyAndSalt>).

The certificate I use is the same on all three nodes from an AD-integrated CA and it is trusted by the nodes according to cert mgr.

I have TBH no idea how to proceed, because the exception information is not helpful at all.

Azure HPC Cache
Azure HPC Cache
An Azure service that provides file caching for high-performance computing.
27 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.