Share via


Clustering: What exactly is a File Share Witness and when should I use one?

Customers ask from time to time: “What is a File Share Witness (FSW)?” Sometimes they’ve worked with prior versions of clustering and don’t know what a FSW is, or that the option exists. The next question asked is usually: “When should we use one?” Before going into that, I’ll review some subtle differences between legacy cluster quorum options and what options are available from Microsoft today for a Failover Cluster.

Legacy Cluster Quorum Options

Quorum may be defined as the number of members of a group that must be present to conduct business. For a legacy (Windows NT/2000) two-node cluster that lost all communications and became partitioned, whichever node could maintain reservation of the quorum disk first would survive and own it. The quorum disk was a tie-breaker for systems that could not communicate as well as an additional storage location for cluster configuration data. One downside to this model was that if the quorum disk failed, so did the cluster. A legacy two node cluster could not function without it. So if just the disk failed but both nodes remained, the cluster would cease to function. Therefore, the quorum disk was very important for legacy clusters. With Windows Server 2003 clusters with more than two nodes, Majority Node Set was another quorum option. A MNS cluster can only run when the majority of cluster nodes are available. This model is typically not chosen for two node clusters because with this model you have to have two nodes minimum ((2 / 2) +1 = 2)…to maintain majority. As of 2003 SP1, an option was added to allow use of a File Share Witness to add an additional vote so that in the same example above, a two node cluster could function with the loss of one node.

 Quorum Options for Modern Clusters

With Windows Server 2008 and later clusters, if all network communication becomes severed between nodes, and quorum must be established, votes are counted. (The exception to this statement would be the disk only quorum model which is similar to the quorum model in legacy clusters where the only vote that is counted is the disk. This option is not recommended, and is rarely chosen.) By default, each node gets a vote and if configured, a single witness may be counted as a vote. A witness may be either a Witness Disk or File Share Witness (FSW). However, you cannot use both. Half of all possible votes + 1 must exist for there to be quorum. Therefore, for an even number of nodes you would want to have a FSW or witness disk. This means that the disk or FSW can cease to exist and as long as enough nodes are online then the cluster may still function.

The following TechNet link is a great reference for 2008 and 2008 R2 quorum options:

https://technet.microsoft.com/en-us/library/cc770620(v=WS.10).aspx

From that link, the following table describes when you would use an alternate method based on even or odd number of nodes.

Description of cluster

Quorum recommendation

Odd number of nodes

Node Majority

Even number of nodes (but not a multi-site cluster)

Node and Disk Majority

Even number of nodes, multi-site cluster

Node and File Share Majority

Even number of nodes, no shared storage

Node and File Share Majority

 

What is a FSW and how does it differ from a disk?

All this discussion about votes is a perfect segue into what a File Share Witness (FSW) is and how it differs from a witness disk. A FSW is simply a file share that you may create on a completely separate server from the cluster to act like a disk for tie-breaker scenarios when quorum needs to be established. The share could reside on a file server, domain controller, or even a completely different cluster. A witness share needs to be available for a single connection, and available for all nodes of the cluster to be able to connect to – if you are using the FSW option for quorum. The purpose of the FSW is to have something else that can count as a vote in situations where the number of configured nodes isn’t quite enough for determining quorum. A FSW is more likely to be used in multi-site clusters or where there is no common storage. A FSW does not store cluster configuration data like a disk. It does, however, contain information about which version of the cluster configuration database is most recent. Other than that, the FSW is just a share. Resources cannot fail to it, nor can the share act as a communications hub or alternate brain to make decisions in the event cluster nodes cannot communicate.

Remember the old capture the flag game you might have played at summer camp? Typically each team had a flag and had to protect it from capture by the opposing team. However, a variation on that game is where there is only one flag located at an alternate location in the woods and two teams try to find it. When the flag is captured by one team, that team wins. A FSW is somewhat like the flag at an alternate location where a team of surviving nodes that can obtain it are able to reach quorum and other nodes that cannot drop out of the active cluster.

A Witness Disk is similar to a FSW except rather than being a file share somewhere on your network like a FSW, it is an actual disk provided by disk storage that is common to the cluster. Nodes arbitrate for exclusive access to the disk and the disk is capable of storing cluster configuration data. A witness disk counts as a vote. 

By this point you may have a good understanding of what a FSW is, when it might be used, what it is, and what it isn’t. Now let’s look at a couple of make-believe cluster configurations that each use a FSW that are similar but quite different.

 

This is a 4 node cluster that is split between two sites with a FSW in Site C. In this configuration there are 5 possible votes. Bidirectional communication is possible on the Public and Private networks. Site A and Site B have a link to Site C to connect to the FSW, but neither Site A or Site B may connect with each other through Site C’s network. If one of the bi-directional networks fails, there remains one network that may be used for cluster communications for the nodes to determine connectivity and decide the best recovery method. If both bi-directional networks fail, then this cluster is partitioned into two sets of two nodes and the FSW must be leveraged to determine which set of nodes survives. The first set of nodes to successfully gain access to the FSW will survive. IF the network to Site C were a complete network allowing communication between Site A and Site B as well, then there would be an alternate communication path for the cluster to determine the best course of action…and this configuration would be that much better.

 

This variation is very similar to the first example and is something that customers have been known to implement. In this case, both bi-directional networks are actually VLANs that go through the same network connection. Therefore, the two separate networks have the vulnerability of a single network. Even if the Leased Net were redundant, that piece of this puzzle remains provided by the same network provider or could be cut by a backhoe somewhere between. So there exists a possibility the Leased Net segment could go down or become unresponsive. The validation process will warn if there is only a single network but it has no insight into what the underlying network actually is between the two sites. An extra network or bi-directional communication link through Site C would once again be an improvement.

Without a FSW in these two configurations, there would be no way to break the tie during communication loss as there are two sets of two nodes otherwise.

Let’s take a non-technical example. Imagine you have 4 people on a camping trip. They park the truck at the camp parking lot and split into two groups of two. Each group hikes 5 miles in opposite directions from the truck they arrived in but at most, the groups are 5 miles from each other. They each make camp in a triangular setup like the second diagram above. Each group has a cell phone to communicate (with voice and text capability) and a heater. The truck contains a spare heater with a partial tank of fuel. Late at night, the heater at each camp fails. One runs out of fuel. The other experiences a mechanical problem. One or both groups are beyond cell range and cannot communicate. The first person that hikes back to the truck to get the spare heater has heat for their tent at their campsite. It doesn’t matter if both camps can hike to the truck or can even see the truck directly from their camp. There is only one spare heater. Since there is no communication available, they can’t call or txt message each other. The camp that ran out of fuel ends up getting the spare heater and almost enough fuel to make it through the night. The other camp wastes energy of one person hiking to the truck and back to find there is no spare heater and ultimately freezes with no working heater and extra fuel they can’t use. Ideally, if the camps could communicate by other means, they would know to meet at the truck, swap the broken heater for the spare and divide up all the fuel so that both camps could have heat, or decide to leave and get coffee on the way home. With only one real communication path, the best decision for all 4 people could not be made.   

The cluster configuration is similar. With all communication severed, the cluster has to decide what to do based on the information available which may be limited. The FSW in the cluster example is capable of breaking the tie of two votes against two votes. However, the cluster is not able to help discern which site is actually the best one to continue because of other conditions. 

How do I know if I’ve chosen the best quorum model?

For a cluster with an odd number of nodes, Node Majority is the typical model chosen. However, with an even number of nodes it makes sense to have a witness resource as a vote. The validation process for clusters in Windows Server 2008 R2 can validate the cluster against best practices and suggest a different quorum model if the current selection is not the best option.