Question for the group:
Server 2019 configured in two 7 node Hyper-V Clusters with scvmm managing the environment, with the front-end network connected with 25gb connections with Vswitches built as Switch Embedded Teams and two separate networks for iSCSI connected CSVs using MPIO. Front End switches are Cisco 100gb, Storage network is Mellanox Ethernet 100gb switches. We also have a new 2 node Test/Dev cluster with the same hardware and configuration if the event happened again, which it did this past weekend.
Over the last 1 month we've had 2 events.
It appears that the frontend network, somewhere is a pause, or a drop (but switch logs don't indicate any issues) that shows in the Microsoft-Windows-FailoverClustering-Diagnostics logs of the heartbeat missed. In the matter of about 18 seconds from initial missed heartbeat identification through the event to the first NTP time receipt, the clusters have evicted the members, lost quorum, lost the CLUSB file contents and the cluster service stopped.
The Cluster has been setup with a Disk Witness and a File Witness with the loss occurring both times.
In the first event we didn't have a backup of the cluster node thus no CLUSDB recovery so we rebuilt both clusters.
The second time we had a copy of the CLUSDB but the process to recover had not been tested except to shutdown all nodes. Copy the CLUSDB to the last known Cluster owner. Stop the cluster service, Unload the cluster registry hive and replace the CLUSDB file. When we did that the CLUSDB got over written with the Force Cluster Start to recover the first node. A 5mb+/- CLUSDB file dropped to a 16k file. In the second event we ended up rebuilding the clusters again.
In Server 2016 we would have seen the nodes drop service in the Failover Cluster Manager(FCM) and the VMs go to an unmonitored state. In these two events there wasn't that same visual clue.
We are now reviewing the 2-node Test/Dev cluster which experienced the same thing as the two production clusters where I pulled all logs 3 minutes before the event and 7 minutes past the event to find the 18 second window.
Has anyone else had Hyper-V Cluster events that have lost the entire Cluster configuration. Did you get the cause analysis by MS or were you able to identify cause?
Has anyone changed any cluster settings to increase time outs?
What are your current quorum configurations that are being used running a multi-node Clustered Hyper-V 2019?
Basically it comes down to the cluster communication drops/pauses, the nodes lose quorum and then lose the cluster settings and then they have to be recovered. Not a good thing.
Background, we are well experienced with Hyper-V Cluster/2012R2/16 and 2019. Appreciate any input on this. Thanks