Part 2: Datacenter Activation Coordination and the File Share Witness

In Part 1 of this series, I provided a high level overview of how Datacenter Activation Coordination (DAC) mode works and how the database mount process on startup is affected when DAC mode is enabled.  (https://blogs.technet.com/b/timmcmic/archive/2012/05/21/my-databases-do-not-automatically-mount-after-i-enabled-datacenter-activation-coordination.aspx)

 

Remember, with DAC mode enabled, different rules apply for mounting databases on startup. The starting DAG member must be able to participate in a cluster that has quorum, and it must be able to communicate with another DAG member that has a DACP value of 1 or be able to communicate with all DAG members listed on the StartedMailboxServers list.

 

When a datacenter switchover has been performed and DAC mode is enabled, there could exist a condition where the standby datacenter contains a single DAG member (and therefore a single node cluster). This raises two interesting conditions:

 

  • A single node cluster always has quorum
  • There could be a single server on the started servers list

 

If the primary datacenter were restored to service without connectivity to the standby datacenter, this configuration could result in a split brain condition. To protect against this, we use an independent arbitrator to assist in determining the DACP bit: the boot time of the witness server.

 

When DAC mode is enabled, a DAG member will record two values in the registry at HKEY_LOCAL_MACHINE\Software\Microsoft\ExchangeServer\v14\Replay\Parameters:

 

  • BootTimeCookie = the boot time of the DAG member
  • BootTimeFSWCookie = the boot time of the witness server (which we obtain using WMI)

 

When DAC mode is enabled, there are different mount-on-startup rules that apply when only a single DAG member remains:

 

  • If the bootTimeCookie equals the boot time of the DAG member <and> the bootTimeFSWCookie does not equal the boot time of the witness server, then the DACP bit is set to 1.
  • If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, then the DACP bit is set to 1.
  • If the bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, then the DACP bit is set to 1.
  • If the bootTimeFSWCookie is not equal to the boot time of the witness server <and> the bootTimeCookie is not equal to the boot time of the DAG member, then the DACP bit remains at 0.

 

In the following examples, a two-member DAG was configured, and a datacenter switchover was performed resulting in a single-node cluster. The specific test, with tracing data, is provided for each example.

 

Example #1

 

In this example, the Microsoft Exchange Replication service on the single surviving node is restarted. Neither the DAG member itself nor the witness server was restarted. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie equals the boot time of the DAG member, resulting in a DACP bit of 1.

 

438 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
439 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
455 00000000 5264 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 18:02:49.
456 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 18:02:49, and the boot time when the Mount bit was set was 05/27/2012 18:02:49.
457 00000000 5264 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
458 00000000 5264 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
460 00000000 5264 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #2

 

In this example, the remaining DAG member was restarted and the witness server remained running. The bootTimeFSWCookie equals the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the DAG member, resulting in a DACP bit of 1.

 

85 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 16:28:12.
86 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
87 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
88 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:11:49, and the boot time when the Mount bit was set was 05/27/2012 18:58:31.
89 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster -- this is not sufficient to allow mounts!
90 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
91 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 16:28:12, and the boot time when the Mount bit was set was 05/27/2012 16:28:12.
92 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine found matching boot timestamps, assuming that only this computer has restarted since setting the bit.

93 00000000 2996 Cluster.Replay ActiveManager AllowAutoMount called: Found matching FSW boot timestamps, assuming that only this computer has restarted since setting the bit.
94 00000000 2996 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:11:49.
95 00000000 2996 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning True.
96 00000000 2996 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #3

 

In this example, the witness server was rebooted and the Microsoft Exchange Replication service on the DAG member was restarted. The bootTimeFSWCookie does not equal the boot time of the witness server <and> the bootTimeCookie does equal the boot time of the DAG member resulting, in a DACP bit of 1.

 

263 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:36:51.
264 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
265 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.

266 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:27:30, and the boot time when the Mount bit was set was 05/27/2012 19:27:30. 267 00000000 1552 Cluster.Replay ActiveManager DetermineAutomountConsensus found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
268 00000000 1552 Cluster.Replay ActiveManager AllowAutoMount called: Found matching boot timestamps, assuming that the replay service has restarted since setting the bit.
269 00000000 1552 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:27:30.
270 00000000 1552 Cluster.Replay ActiveManager RefreshConfigInternal: The Automount consensus is true.

Example #4

 

In this last example, both the witness server and the remaining single DAG member were restarted. Thus, the bootTimeFSWCookie does equal the boot time of the witness server <and> the bootTimeCookie does not equal the boot time of the remaining DAG member. As such, the DACP bit remains at 0.

 

76 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for dc.exchange.msft is 05/27/2012 19:47:49.
77 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: checking if the replay service has restarted since the MommyMayIMount bit was set.
78 00000000 3664 Cluster.Replay ActiveManager GetBootTimeWithWmi: WMI says that the boot time for MBX-2.exchange.msft is 05/27/2012 19:55:40.
79 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensus: WMI says the boot time is 05/27/2012 19:55:40, and the boot time when the Mount bit was set was 05/27/2012 19:27:30.
80 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity: There is only one node in the cluster -- this is not sufficient to allow mounts!
81 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: checking if the file share witness has restarted since the MommyMayIMount bit was set.
82 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusForSingleMachine: WMI says the boot time for the FSW server is 05/27/2012 19:47:49, and the boot time when the Mount bit was set was 05/27/2012 19:36:51.
83 00000000 3664 Cluster.Replay ActiveManager DetermineAutomountConsensusUnanimity is returning False.
84 00000000 3664 Cluster.Replay ActiveManager Automount consensus not reached, going to Unknown AM role.

 

When performing a datacenter switchover where only a single node remains in the cluster supporting the DAG, any reboot that changes both the boot time of the witness server and the boot time of the DAG member will prevent databases from mounting automatically. If the reboots were necessary and valid operations, administrators can force the databases online without causing split brain.

 

========================================================

Datacenter Activation Coordination Series:

 

Part 1:  My databases do not mount automatically after I enabled Datacenter Activation Coordination (https://aka.ms/F6k65e)
Part 2:  Datacenter Activation Coordination and the File Share Witness (https://aka.ms/Wsesft)
Part 3:  Datacenter Activation Coordination and the Single Node Cluster (https://aka.ms/N3ktdy)
Part 4:  Datacenter Activation Coordination and the Prevention of Split Brain (https://aka.ms/C13ptq)
Part 5:  Datacenter Activation Coordination:  How do I Force Automount Concensus? (https://aka.ms/T5sgqa)
Part 6:  Datacenter Activation Coordination:  Who has a say?  (https://aka.ms/W51h6n)
Part 7:  Datacenter Activation Coordination:  When to run start-databaseavailabilitygroup to bring members back into the DAG after a datacenter switchover.  (https://aka.ms/Oieqqp)
Part 8:  Datacenter Activation Coordination:  Stop!  In the Name of DAG... (https://aka.ms/Uzogbq)
Part 9:  Datacenter Activation Coordination:  An error cause a change in the current set of domain controllers (https://aka.ms/Qlt035)

========================================================

Comments

  • Anonymous
    January 01, 2003
    @Rika....

    It's very unlikely that a split brain condition occurred. For this to happen - you'd have to have quorum in both sides as a pre-requisite.

    TIMMCMIC

  • Anonymous
    January 01, 2003
    Hi Tim, Great blog again.. Actually, my question is the same as "Ian_N" is asking. Basically, how or how often the BootTimeFSWCookie and BootTimeCookie get updated? In example 3, why replication service also restarted? This made me curious that do I really need to restart replication service on last standing DAG member server after rebooting fsw. Because, assuming BootTimeFSWCookie is not updated without restarting replication service, BootTimeFSWCookie remains 05/27/2012 16:28:12. And if DAG member is rebooted AFTER example 3 ( fsw was rebooted), both BootTimeCookie and BootTimeFSWCookie will not match and Mommy will say "no" when asking MommyMayIMount. I performed a quick test. In my test, when only fsw was rebooted, after a while BootTimeFSWCookie was updated (without restarting replication service) on DAG member. In normal operation, when both data centers are up, BootTimeFSWCookie reg value is updated on all DAG members after rebooting Fsw. Thanks, Neothwin

  • Anonymous
    January 01, 2003
    @ian_n and @Neothwin I know this is a little late to reply, but I believe those values are updated every time the MommyMayIMount bit is successfully set and they are updated with whatever WMI returns as the boot time.  Hopefully someone will correct me if I'm wrong.

  • Anonymous
    January 01, 2003
    When does the DAG member or FSW actually record the BoottimeCookie or BootTimeFSWToCookie value? Wouldn't the boot time values always be the same? There has to be a difference in time as to when the value is checked and when it is updated.  

  • Anonymous
    January 01, 2003
    @enid...

    I'm not sure - things look the same to me.

    TIMMCMIC

  • Anonymous
    January 01, 2003
    @Talha: Actually DAC most likely does not have anything to do with your issue if connectivity is lost / power outage.  The question is where do you have a quorum of your nodes.  In your instance you'll probably want to have a quorum of your nodes available in the remote data center until you can correct the power condition.  This should allow databases to failover, come online in the remote site, and remain online.  When the primary comes back online they will eventually resync - clients would have to have access across the WAN. If you are saying that you do not want databases online in the remote and you continue to have these power outages then this could be DAC.  Having DAC disabled would / should allow the databases to come online in the primary as long as a quorum of nodes exists. TIMMCMIC

  • Anonymous
    August 09, 2012
    The comment has been removed

  • Anonymous
    November 27, 2013
    The comment has been removed

  • Anonymous
    August 27, 2014
    how come that we experience Split Brain on Ex2013 that have DAGONLY configured?, we have one DAG with 6 nodes and a FSW in a third site. 3 nodes on each site. half of the DB active on Site-A and half on Site-B. suddenly all DB were mounted on each Site (same DB were mounted on each).

  • Anonymous
    August 28, 2014
    Hi, It seems a little difference of Exchange 2013: DAC mode for DAGs with two members
    http://technet.microsoft.com/en-us/library/dd979790(v=exchg.150).aspx