Server 2019 and Hyper-V Clusters

Dave Kreitel 1 Reputation point
2021-01-25T22:10:51.057+00:00

Question for the group:
Server 2019 configured in two 7 node Hyper-V Clusters with scvmm managing the environment, with the front-end network connected with 25gb connections with Vswitches built as Switch Embedded Teams and two separate networks for iSCSI connected CSVs using MPIO. Front End switches are Cisco 100gb, Storage network is Mellanox Ethernet 100gb switches. We also have a new 2 node Test/Dev cluster with the same hardware and configuration if the event happened again, which it did this past weekend.

Over the last 1 month we've had 2 events.

It appears that the frontend network, somewhere is a pause, or a drop (but switch logs don't indicate any issues) that shows in the Microsoft-Windows-FailoverClustering-Diagnostics logs of the heartbeat missed. In the matter of about 18 seconds from initial missed heartbeat identification through the event to the first NTP time receipt, the clusters have evicted the members, lost quorum, lost the CLUSB file contents and the cluster service stopped.

The Cluster has been setup with a Disk Witness and a File Witness with the loss occurring both times.

In the first event we didn't have a backup of the cluster node thus no CLUSDB recovery so we rebuilt both clusters.
The second time we had a copy of the CLUSDB but the process to recover had not been tested except to shutdown all nodes. Copy the CLUSDB to the last known Cluster owner. Stop the cluster service, Unload the cluster registry hive and replace the CLUSDB file. When we did that the CLUSDB got over written with the Force Cluster Start to recover the first node. A 5mb+/- CLUSDB file dropped to a 16k file. In the second event we ended up rebuilding the clusters again.

In Server 2016 we would have seen the nodes drop service in the Failover Cluster Manager(FCM) and the VMs go to an unmonitored state. In these two events there wasn't that same visual clue.

We are now reviewing the 2-node Test/Dev cluster which experienced the same thing as the two production clusters where I pulled all logs 3 minutes before the event and 7 minutes past the event to find the 18 second window.

Has anyone else had Hyper-V Cluster events that have lost the entire Cluster configuration. Did you get the cause analysis by MS or were you able to identify cause?
Has anyone changed any cluster settings to increase time outs?
What are your current quorum configurations that are being used running a multi-node Clustered Hyper-V 2019?

Basically it comes down to the cluster communication drops/pauses, the nodes lose quorum and then lose the cluster settings and then they have to be recovered. Not a good thing.

Background, we are well experienced with Hyper-V Cluster/2012R2/16 and 2019. Appreciate any input on this. Thanks

60383-image.png

60363-image.png

Windows Server Clustering
Windows Server Clustering
Windows Server: A family of Microsoft server operating systems that support enterprise-level management, data storage, applications, and communications.Clustering: The grouping of multiple servers in a way that allows them to appear to be a single unit to client computers on a network. Clustering is a means of increasing network capacity, providing live backup in case one of the servers fails, and improving data security.
1,023 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Xiaowei He 9,916 Reputation points
    2021-01-26T06:44:01.957+00:00

    Hi,

    When we did that the CLUSDB got over written with the Force Cluster Start to recover the first node.

    1. I noticed you use the backup CLUSDB to recovery the cluster, generally, I will not do that. If the Cluster down due to heartbeat loss, after the network recovery, we can start one Cluster node, then online other nodes one by one, the cluster DB will be updated and synced to all nodes after the Cluster re-online.

    Has anyone changed any cluster settings to increase time outs?

    1. Based on my understanding, you may want to know how to increase the multi-site cluster heartbeat threshold, if so, please refer to the following after about how to change the heartbeat threshold:

    https://techcommunity.microsoft.com/t5/failover-clustering/tuning-failover-cluster-network-thresholds/ba-p/371834

    What are your current quorum configurations that are being used running a multi-node Clustered Hyper-V 2019?

    Generally, we'd use file share witness for multi-site cluster.

    Thanks for your time!
    Best Regards,
    Anne

    -----------------------------

    If the Answer is helpful, please click "Accept Answer" and upvote it.

    Note: Please follow the steps in our documentation to enable e-mail notifications if you want to receive the related email notification for this thread.

    0 comments No comments

  2. Dave Kreitel 1 Reputation point
    2021-01-27T15:08:45.433+00:00

    XiaoweiHe, thank you for the response,

    Sure I can understand the shut the cluster nodes down and then start them one at a time. However the CLUSDB file has to have the configuration contained with in it. A current fully functioning cluster we have has a CLUSDB file size if 4608kb. After the events the CLUSDB file is reduced to either 64k or 16k which indicates that the file has been over written.
    These are two separate Hyper-V Clusters and not in a multi-site configuration.

    We've pulled all logs from 3 minutes before the event to 5 minutes after the event. What we found was to report of network connections being unconnected.

    1. The 1st events are the missed more than 40% of consecutive heartbeats
    2. Within 5 seconds the cluster has lost UDP connecyion from the host to a different host on port 3343.
    3. Then the volumes go off line
    4. At the 17 second from the 1st event the all cluster nodes are removed from the active failover,(this is where I'm assuming the CLUSDB file is being cleared on each node. There should be a copy on the Disk Quorum but it's off line being managed by the cluster but is not a CSV.
    5. At 1min 23 seconds from the first events the hosts report The Cluster database could not be loaded. The file may be missing or currupt.

    I have an open ticket for a cause analysis.
    As I said earlier this has occurred multiple times and about 1 month apart.
    No activities on the network are being performed.
    There is nothing leading up to the issue that would point to a switch to stop internode communication

    But what has changed in the Server 2019 Clustering that would leave a fully configured server from blowing up in the event of the loss of connectivity and deleting the contents of the Cluster database.

    Any help is greatly appreciated.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.