Why not stretch CCR nodes across 2 Data Centres..?

Have had a number of conversations with customers concerning the merits of achieving both data centre resilience and cost cutting by stretching CCR and avoiding deploying both CCR and SCR.   i.e. 3 for the price of 2.

Firstly I wouldn’t consider stretching CCR if you are not confident of either the latency of the underlying network, or the reliability of the link.  However there are plenty of enterprise customers with multiple Gigabit data centre interconnects for whom stretching CCR seems a viable alternative to deploying both CCR and SCR.

So what are the pro’s and con’s of doing so?

Pro Con Explanation
Less expensive to deploy and manage Only 2 sets of servers and storage required as opposed to 3
Less servers to manage Complex to manage Whilst there are going to be less servers to manage; in order to make most efficient use of the data centre interconnect, and to ensure backups of the passive node, the solution must have a ‘normally’ active node. This is often a change to the way the solution is managed
Manual configuration may be required to control message routing within a data centre Exchange Server 2007 uses AD site based routing. Each mailbox role server will use any hub transport server in the same site regardless of data centre location
Difficult to control client access within a data centre Exchange Server 2007 provides client access (OWA, OA etc..) based on AD site membership. CAS<->MBX server MAPI may occur across the data centre interconnect. (Outlook connects directly to mailbox role for most operations)
  Querying of Active Directory may take place across the data centre interconnect Exchange Server 2007 makes no distinction between GC’s in the same AD site. AD queries will take place across the data centre interconnect which can lead to delays in message delivery for example..
  Outlook clients will experience a delay following failover (for both managed and unmanaged failover * In this configuration, there is one Network Name resource and two IP addresses on which the Network Name is dependent. In DNS, the network name is associated with the current online IP address. During failover, as the Network Name resource comes online, the Cluster service updates the DNS entry for the Network Name with the second IP address, which corresponds to the other subnet. The record update has to propagate throughout DNS. From Outlook’s perspective, Outlook does not need a new or reconfigured profile, but it does need to wait for its local DNS cache to flush to allow the Network Name to resolve to the other IP address.
More stringent requirements of the network ** CCR designed to be deployed within a data centre. To avoid database copies becoming out of sync, and potential data losses increasing, there are more specific requirements in terms of network latency and bandwidth.
Less resilient 2 copies of the data with CCR->CCR as opposed to 3 with CCR->CCR->SCR
Not recommended by Microsoft *** Although this is a supported solution it is not recommended by Microsoft

* I have made the assumption here that the solution will be deployed on Windows 2008 and that it is not possible to stretch a subnet between the data centres. If this is the case two nodes must be in different subnets. Following cluster failover the change in the IP address means that the client must wait to update its DNS entry before it will connect to the CMS. To quote Technet (Installing Cluster Continuous Replication on Windows Server 2008):

“…the name of the CMS does not change, but the IP address assigned to the CMS changes. Clients and other servers that communicate with a CMS that has changed IP addresses will not be able to re-establish communications with the CMS until DNS has been updated with the new IP address, and any local DNS caches have been updated. To minimize the amount of time it takes to have the DNS changes known to clients and other servers, we recommend setting a DNS TTL value of five minutes for the CMS Network Name resource.”

If you are able to stretch a subnet then this disadvantage disappears.

** As CCR makes use of asynchronous replication the requirements of the network are not difficult to meet. However by stretching CCR you need to be more confident of the reliability and performance of the underlying network.
*** See Site Resilience Configurations

And so just to finish off this discussion there is also the issue of Failover\Failback following the loss of a data centre.

It often appears that a stretched cluster makes it easier to failover and failback following the loss and subsequent rebuild of a primary data centre. It is important to remember that the 2 node cluster is a majority node set cluster and as such uses a File Share Witness (FSW) to maintain quorum. In effect therefore it is a 3 node cluster with 2 of the 3 nodes in the primary data centre. This means that in the event of the loss of the primary data centre there are a series of steps to follow to ‘force’ the passive node online when it cannot contact the FSW. (Placing the FSW in the secondary data centre in the 1st place is not recommended because if you lose the link between the data centres you have lost access to the FSW which is deemed more likely to occur than complete data centre failure.) To failback to the primary data centre must be very carefully managed since there may be for a time 2 FSW’s and 2 instances of the same set of databases. The likelihood is that the formerly active node will need to be rebuilt and\or the databases reseeded. This process can be more time consuming and difficult to manage than with CCR->CCR->SCR where failback steps are better documented and understood.

So in my opinion you should deploy a combination of CCR\SCR where possible.  However if you are confident that you understand all the issues related to stretching CCR and that you can manage the solution successfully then I believe it is a viable option...

Comments

  • Anonymous
    January 08, 2009
    The comment has been removed
  • Anonymous
    January 08, 2009
    The comment has been removed
  • Anonymous
    February 23, 2009
    The comment has been removed
  • Anonymous
    February 23, 2009
    SCR is unlikely to be any help with a logical type of corruption since you are unlikely to know what caused the corruption in terms of a transaction within the log stream.  Recovery mechanism would likely be to move mailboxes to a new database, perhaps in combination with a restore (as well as fix the root cause of the problem).  CCR does protect you from physical corruption since if a database is damaged you can failover to the second node (or restore if you don't want to fail the entire server over [again once you've fixed the root cause]).  If it is a transaction log that is damaged the replication service will not replay that log into the second database.  You wouldn't invoke SCR to recover from a database corruption in my opinion.