Inconsistent results when enabling standby continuous replication (SCR) in Exchange 2007 SP1.
A common question that I receive from customers is why do I experience inconsistent results when I enable a storage group in Exchange 2007 SP1 for standby continuous replication. Usually the conversation focuses on why replication instances initially show failed and then soon after go healthy, or why the replication service reports that databases are not configured for standby continuous replication even though the command was run and successful.
Standby continuous replication was introduced in Exchange 2007 SP1 as a way to replicate databases from any mailbox role source to an independent mailbox role target. Most commonly customers implement this technology as part of a broader site resiliency plan. Information regarding standby continuous replication can be found here: https://technet.microsoft.com/en-us/library/bb676502.aspx.
The command that is used to enable standby continuous replication is enable-storagegroupcopy -identity <storagegroup> -standbymachine <target>. More information on this commandlet can be found here: https://technet.microsoft.com/en-us/library/bb123684.aspx.
When the enable-storagegroupcopy command is used an attribute on the storage group is updated in the active directory. The attribute is msExchStandbyCopyMachines. This is a muti-valued attribute to reflect that a database can be replicated to multiple SCR targets. When the command is successfully run, the target name used is populated in the attribute, along with values representing TruncationLagTime and ReplayLagTime. Here is a sample LDP dump of a storage group enabled for SCR.
========================================
Expanding base 'CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft'...
Getting 1 entries:
Dn: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft
cn: 2008-MBX1-SG1;
distinguishedName: CN=2008-MBX1-SG1,CN=InformationStore,CN=2008-MBX1,CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=Exchange,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=exchange,DC=msft;
dSCorePropagationData: 0x0 = ( );
instanceType: 0x4 = ( WRITE );
msExchESEParamBaseName: E00;
msExchESEParamCheckpointDepthMax: 20971520;
msExchESEParamCircularLog: 0;
msExchESEParamCommitDefault: 0;
msExchESEParamCopyLogFilePath: f:\2008-MBX1\2008-MBX1-SG1-Logs-LCR;
msExchESEParamCopySystemPath: e:\2008-MBX1\2008-MBX1-SG1-System-LCR;
msExchESEParamDbExtensionSize: 256;
msExchESEParamEnableIndexChecking: TRUE;
msExchESEParamEnableOnlineDefrag: TRUE;
msExchESEParamEventSource: MSExchangeIS;
msExchESEParamLogFilePath: d:\2008-MBX1\2008-MBX1-SG1-Logs;
msExchESEParamLogFileSize: 1024;
msExchESEParamPageFragment: 8;
msExchESEParamPageTempDBMin: 0;
msExchESEParamSystemPath: d:\2008-MBX1\2008-MBX1-SG1-System;
msExchESEParamZeroDatabaseDuringBackup: 0;
msExchHasLocalCopy: 1;
msExchMinAdminVersion: -2147453113;
msExchStandbyCopyMachines: 2008-MBX2.exchange.msft;1;1.00:00:00;00:00:00;
msExchVersion: 4535486012416;
name: 2008-MBX1-SG1;
objectCategory: CN=ms-Exch-Storage-Group,CN=Schema,CN=Configuration,DC=exchange,DC=msft;
objectClass (3): top; container; msExchStorageGroup;
objectGUID: 7dd4c453-9052-43c6-9e18-845f8e616520;
showInAdvancedViewOnly: TRUE;
systemFlags: 0x40000000 = ( CONFIG_ALLOW_RENAME );
uSNChanged: 57771;
uSNCreated: 33269;
whenChanged: 9/17/2008 4:55:12 PM Eastern Standard Time;
whenCreated: 9/15/2008 6:11:32 PM Eastern Standard Time;
-----------
========================================
The enable-storagegroupcopy command does not interact directly with the replication service in order to start a new replication instance. Internal to each replication service is a configuration update process. When the configuration update process runs, the replication service determines by reading the active directory which database instances need to be replicated. A list is generated and compared to the instances the replication service is already running. When a new replication instance is found, the replication service will spawn the instance. When an instance already exists and is no longer replicated, the replication service will destroy that instance.
For standby continuous replication the configuration update process runs on the source every 30 seconds - on the target every 3 minutes.
On the source machine, when the configuration update process runs and determines that a database on the source has been enabled for SCR the replication service will create the file shares necessary for the target to access the source and replicate logs.
On the target machine, when the configuration update process runs and determines that a database is enabled for SCR and replicated to that machine, the instance is added to the replication service and the replication service begins the process of copying logs etc.
This is where customers start to experience inconsistent results. Standby continuous replication is dependant on reading an active directory attribute, it is also dependant on the time it takes that attribute to replicate to a domain controller in the source and a domain controller in the target. Until that attribute replicates to both locations, and the configuration update process runs in both locations, standby continuous replication will not be fully enabled for this storage group.
Let me provide an example.
There are three different examples that I have found that show inconsistent results.
Example #1:
In this example we have a standalone mailbox server in SiteA. The SCR target for this standalone mailbox server is in SiteB. SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them. On the SCR target machine the administrator runs enable-storagegroupcopy -standbymachine - the command completes successfully. After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a FAILED state.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
atus th ength dLogTime
---- ------------- ------------- ------------ ------------
2008-MBX1-SG1 Failed 0 0
2008-MBX1-SG2 Failed 0 0
A review of the source shows that the shares necessary to replicate log files were not created. At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine. It is noted that all storage groups appear healthy, and that the shares necessary to copy logs exist on the source.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
atus th ength dLogTime
---- ------------- ------------- ------------ ------------
2008-MBX1-SG1 Healthy 5 25
2008-MBX1-SG2 Healthy 1 50
The behavior here is by design. When the administrator enabled SCR on the target machine it stamped the msExchStandbyCopyMachines attribute on the domain controller in SiteB. Within a 3 minute window the replication service on the target machine runs the configuration update process. The new replication instance is detected and the replication service starts to attempt to copy logs. The attribute though has not replicated to a domain controller in SiteA, therefore the replication service in SiteA does not know to create the shares necessary to service replication. This results in the replication instances being marked FAILED. After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the source has run, the new replication instance detected, and the shares created. At this point the replication service can now access the logs on the source and the replication instances are marked HEALTHY. (Note that the same example applies to a single copy cluster [scc] source.)
Example #2:
In this example we have a standalone mailbox server in SiteA. The SCR target for this standalone mailbox server is in SiteB. SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them. On the SCR source machine the administrator runs enable-storagegroupcopy -standbymachine - the command completes successfully. After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear in a NOTCONFIGURED state.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
atus th ength dLogTime
---- ------------- ------------- ------------ ------------
2008-MBX1-SG1 NotConfigured 0 0
2008-MBX1-SG2 NotConfigured 0 0
A review of the source shows that the shares necessary to replicate log files are created. At this time the admin steps away for 30 minutes and comes back to check replication again with get-storagegroupcopystatus -standbymachine. It is noted that all storage groups appear healthy.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
atus th ength dLogTime
---- ------------- ------------- ------------ ------------
2008-MBX1-SG1 Healthy 5 25
2008-MBX1-SG2 Healthy 1 50
The behavior here is by design. When the administrator enabled SCR on the source machine it stamped the msExchStandbyCopyMachines attributes on the domain controller in SiteA. Within a 30 second window the replication service on the source machine runs the configuration update process. The new replication instance is detected, and the replication service creates the shares on the source. The attribute though has not replicated to a domain controller in SiteB, therefore the replication service in SiteB is not aware of the replication instances and responds NotConfigured when queried for status. After waiting 30 minutes, active directory replication has had time to occur and the configuration update process on the target has run, the new replication instances detected, and the replication process started. At this point the replication service is aware of the instances, and responds with a healthy status when queried. (Note that the same example applies to a single copy cluster [scc] source.)
Example #3:
In this example we have a cluster continuous replication source in SiteA. The SCR target for this CCR source is located in SiteB. SiteA and SiteB are both different active directory sites with a 15 minute replication delay between them. On the SCR target machine, the administrator runs enable-storagegroupcopy -standbymachine - the command completes successfully. After a few minutes the get-storagegroupcopystatus -standbymachine command is run and it is noted that all replicated storage groups appear HEALTHY.
Name SummaryCopySt CopyQueueLeng ReplayQueueL LastInspecte
atus th ength dLogTime
---- ------------- ------------- ------------ ------------
2008-MBX1-SG1 Healthy 5 25
2008-MBX1-SG2 Healthy 1 50
This is different from the example outlined in Example #1. In this example the source is a CCR cluster. In order for a CCR cluster to replicate log files between the two source nodes, the shares must exist. Since the shares already exist, we only have to wait for the replication service configuration update process to run on the target machine. AD replication here is not a factor when a target domain controller is used for the enable-storagegroupcopy command.
This information should help administrators explain some of the results of the SCR process and make decisions on where enabling should be performed.