Exchange 2013 DAG: Datacenter Failover and Disaster Recovery
Although we have so many article over the internet for the datacenter failover and site resilience thought to summarize all of them in short note what is need on failover period instead reading 2 to 3 hours on getting the concept what we need.
Exchange 2013 Terminology
Few terminology should be know by Exchange Administrator regarding their environment:
Primary Active Manager which runs inside the Microsoft Exchange Replication Service used to notify and react in case of server failure. The PAM owns the cluster quorum resource and holds the information about active, passive and mounted databases.
Standby Active Manager provides information of the server hosting the active copy of a mailbox database to the Client Access or Transport services.
Datacenter Activation Coordination **** uses a protocol called Datacenter Activation Coordination Protocol (DACP) to avoid split brain .When a DAG is running in DAC mode, When the server reboots, the Active Manager starts up the bit as 0 (Database Dismount state). It communicates with other members in the DAG when it responds the bit set to 1 and allowed to mount database
Quorum Details
Odd number of nodes ---> Node Majority
Even number of nodes (but not a multi-site cluster) ---> Node and Disk Majority
Even number of nodes, multi-site cluster ---> Node and File Share Majority
Even number of nodes, no shared storage ---> Node and File Share Majority
Continuous replication uses initial File Mode to replicate 1 MB of file to the passive database. When File Mode completes it moves to Block Mode for immediate updates
Port 3343 is used Nodes for listening incoming connections from other nodes of the DAG Members
I believe it more enough to know the definition let us move practically what we do in our Exchange infra. It’s always good to have documentation of the below component information which will helps in case if our servers are in disaster.
Verification of Exchange 2013 DAG Components:
Primary Active Manager:
To verify PAM
Get-DatabaseAvailabilityGroup <DAG NAme> -status |fl Name, PrimaryActiveManager
To move PAM on different DAG Member
Cluster group "Cluster Group" /MoveTo:<DAG Server Name>
AutoDatabaseMountDial:** **
Get-Mailboxserver <MailboxServerName> | FL Name, AutoDatabaseMountDial
BestAvailability (default) - Copy queue length of ≤12 Logs count
GoodAvailability - Copy queue length ≤6 Logs count.
Lossless - Copy queue length Zero Log Count
*** ***
Datacenter Activation Coordination (DAC)
Get-DatabaseAvailablityGroup –Identity <DAGName> | FL Name, DataCenterActivationModel
To verify Quorum
cluster /quorum
*** ***
To verify Continuous Replication Mode
Get-Counter -ComputerName <> -Counter “\MSExchange Replication(*)\Continuous replication - block mode Active”
To check replication network*** ***
Get-MailboxDatabaseCopyStatus -Server <Severname> -ConnectionStatus | FL Name, Incominglogcopyingnetwork, Seedingnetwork
To Check DagNetworkConfiguration
Get-DatabaseAvailabilityGroup | FL Name, ManualDagNetworkConfiguration
Check the Exchange server location in AD site
Get-ExchangeServer –Identity <server_name> -Status | FL
Datacenter SwitchOver** **
When the primary site fails due to disaster on the odd nodes due to power Outage or server failure follow the below steps
- Verify the Started Server and Stopped servers in the DAG
Get-DatabaseAvailabilityGroup <DAGName> -Status | FL Name, *Servers
- Use the Stop-DatabaseAvailabilityGroup to mark the primary site DAG members are in failed state.
***Stop-DatabaseAvailabilityGroup –Identity <DAGName> -ActiveDirectorySite PrimarySite ***
- *** ***Verify the Started Server and Stopped servers in the DAG
Get-DatabaseAvailabilityGroup <DAGName> -Status | FL Name, *Servers
- Stop the cluster service in all the passive node of the secondary site
***Stop-service clussvc ***
- Use the Restore-DatabaseAvailablityGroup to remove the stoppedmailbox server from the DAG and re-establish the quorum using the alternate Witness server
Restore-DatabaseAvailabilityGroup <DAGName> -Activedirectorysite DR
- When the service or power is restored in the Primary site is up run Start-DatabaseAvailabilityGroup to revert the datacenter
Start-DatabaseAvailabilityGroup <DAGName> -ActiveDirectorySite ProductionSite
- Check out the Quorum model
Get-ClusterQuorum | fl
- Still if it’s show the older quorum model execute the below powershell cmdlet
DatabaseAvailabilityGroup -Identity DAG01