HPC Pack 2019 Cluster manager cannot connect to newly installed cluster
I have been upgrading on-premise HPC Pack 2016 to 2019 - basically a reinstall. The database is a completely new SQL Server instance and the scripts ran as expected. The installation went through w/o errors on three head nodes with built-in clustering. However, I cannot connect with the HPC Cluster Manager, presumably because the services are constantly crashing. Errors I see (on all nodes) in the application log include:
Application: HpcDiagnostics.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.ComponentModel.Win32Exception at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+<GetKeyAndSalt>d__23.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticCrypto+<InitDefault>d__18.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.Store.DiagnosticsStore+<Init>d__27.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Diagnostics.DiagnosticsSvc+<StartSvc>d__8.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.DiagnosticsWinService.DiagnosticsWinService+<<OnStart>b__2_1>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness+<>c__DisplayClass45_0.<RunAsync>b__0(System.Object) at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch()
and
Application: HpcScheduler.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.ComponentModel.Win32Exception at Microsoft.ComputeCluster.Management.Win32Helpers.HAUtils.SetGenericServiceRegistryCheckpoint(System.String, System.String) at Microsoft.Hpc.Scheduler.SchedulerCrypto+<InitKeyAndSalt>d__29.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.SchedulerCrypto+<InitDefault>d__24.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean, Boolean, System.String, System.String, System.String, System.String, System.Func
2<System.String,System.String>, System.Func
2<System.String,System.String>) at Microsoft.Hpc.Scheduler.Store.SchedulerStoreInternal..ctor(Boolean) at Microsoft.Hpc.Scheduler.SchedulerSvc+<StartSvc>d__19.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.Scheduler.SchedulerService+<<OnStart>b__5_1>d.MoveNext() at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(System.Threading.Tasks.Task) at Microsoft.Hpc.HighAvailabilityModule.Algorithm.MembershipWithWitness+<>c__DisplayClass45_0.<RunAsync>b__0(System.Object) at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean) at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() at System.Threading.ThreadPoolWorkQueue.Dispatch()
and the SCM reports that
The HPC Job Scheduler Service service terminated unexpectedly. It has done this 2300 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC Diagnostics Service service terminated unexpectedly. It has done this 2203 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC Session Service service terminated unexpectedly. It has done this 1843 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC SDM Store Service service terminated unexpectedly. It has done this 1811 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC Monitoring Server Service service terminated unexpectedly. It has done this 1825 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC Reporting Service service terminated unexpectedly. It has done this 1839 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
The HPC Management Service service terminated unexpectedly. It has done this 1837 time(s). The following corrective action will be taken in 30000 milliseconds: Restart the service.
I have already tried a complete rebuild of the cluster (including the DBs) to no avail. I think the issue might be related to TLS, because the call stack contains that crypto stuff (SchedulerCrypto+<InitKeyAndSalt>).
The certificate I use is the same on all three nodes from an AD-integrated CA and it is trusted by the nodes according to cert mgr.
I have TBH no idea how to proceed, because the exception information is not helpful at all.