Exchange 2010 / 2013: What constitutes a failure of the replication network…

In both Exchange 2010 and Exchange 2013, customers can deploy one or more replication networks in a database availability group (DAG).  There can be many reasons for using a replication network, but in most cases they are used to provide a dedicated log shipping channel between members of the same DAG.

 

As documented on TechNet, when a replication network fails, replication should automatically failover to the DAG’s MAPI network:

· https://technet.microsoft.com/en-us/library/dd638104(v=exchg.150).aspx (Exchange 2013)

· https://technet.microsoft.com/en-us/library/dd638104(v=exchg.141).aspx (Exchange 2010)

 

In the event of a failure affecting the Replication network, if the MAPI network is unaffected by the failure, log shipping and seeding operations will revert to use the MAPI network, even if the MAPI network has it's ReplicationEnabled property set to False.

 

Log shipping operations occur by connecting to the Microsoft Exchange Replication service on the server that hosts the active database copy, on TCP port 64327 using a random ephemeral port on the passive server.  Log files are then pushed from the active server to the passive server via this channel.  There can exist several issues that result in this log shipping channel being interrupted.  For example, a firewall may block port 64327, static routes may be missing on multi-subnet replication networks, or an intermediary network device may not route traffic correctly. 

 

In the following example, we have a 4-member DAG running Exchange 2013.  There is a single active database that is replicated to three other servers.

 

image

 

The DAG has two networks that each has two subnets.  These networks represent the MAPI and Replication networks for the DAG.  Automatic network detection in Exchange 2013 was disabled for this example.

 

RunspaceId : 60e6ae0f-e69d-4ae6-9fcb-8c99ea9fd21f
Name : MapiDagNetwork
Description :
Subnets : {{192.168.0.0/24,Up}, {192.168.1.0/24,Up}}
Interfaces : {{MBX-1,Up,192.168.0.11}, {MBX-2,Up,192.168.0.12}, {MBX-3,Up,192.168.1.11},
{MBX-4,Up,192.168.1.12}}
MapiAccessEnabled : True
ReplicationEnabled : True
IgnoreNetwork : False
Identity : DAG\MapiDagNetwork
IsValid : True
ObjectState : New

RunspaceId : 60e6ae0f-e69d-4ae6-9fcb-8c99ea9fd21f
Name : ReplicationDagNetwork01
Description :
Subnets : {{10.0.1.0/24,Up}, {10.0.0.0/24,Up}}
Interfaces : {{MBX-1,Up,10.0.0.1}, {MBX-2,Up,10.0.0.2}, {MBX-3,Up,10.0.1.1}, {MBX-4,Up,10.0.1.2}}
MapiAccessEnabled : False
ReplicationEnabled : True
IgnoreNetwork : False
Identity : DAG\ReplicationDagNetwork01
IsValid : True
ObjectState : New

 

Using Get-MailboxDatabaseCopyStatus with the –ConnectionStatus switch, we can verify that the Replication network is currently in use:

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * -connectionStatus | fl name,incominglogcopyingnetwork,outgoingconnections

Name : DAG-DB0\MBX-1
IncomingLogCopyingNetwork :
OutgoingConnections : {}

Name : DAG-DB0\MBX-2
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-3
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-4
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

 

The router servicing the network link between the two subnets of the Replication network is shutdown.  This will block replication from succeeding over the Replication network.  What happens to the database copies?

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-DB0\MBX-1
Status : Mounted

Name : DAG-DB0\MBX-2
Status : Healthy

Name : DAG-DB0\MBX-3
Status : DisconnectedAndHealthy

Name : DAG-DB0\MBX-4
Status : DisconnectedAndHealthy

 

In this example, the database copies enter a DisconnectedAndHealthy state.  Reviewing the connection status, we note that the connection has timed out between the servers; this is expected since the route is down.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * -ConnectionStatus | fl name,incominglogcopyingnetwork,outgoingconnections

Name : DAG-DB0\MBX-1
IncomingLogCopyingNetwork :
OutgoingConnections : {{MBX-2,ReplicationDagNetwork01}}

Name : DAG-DB0\MBX-2
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-3
IncomingLogCopyingNetwork : {MBX-1,,A timeout occurred while communicating with server 'MBX-1'. Error: "A connection
could not be completed within 15 seconds."}
OutgoingConnections :

Name : DAG-DB0\MBX-4
IncomingLogCopyingNetwork : {MBX-1,,A timeout occurred while communicating with server 'MBX-1'. Error: "A connection
could not be completed within 15 seconds."}
OutgoingConnections :

 

Why didn’t the Replication network failover to the MAPI network in this case?  For one thing, the ability to establish a log shipping channel is not one of the criteria that the Replication service uses to determine the health of a given network.  The Replication service relies on feedback from the Cluster service regarding individual network interface status in order to determine the health of a log shipping channel.

 

For each subnet that exists on a DAG member, an associated cluster network is created.  In this example, there are 4 subnets and therefore there are 4 cluster networks.

 

[PS] C:\>Get-ClusterNetwork | fl

Name : Cluster Network 1
State : Up

Name : Cluster Network 2
State : Up

Name : Cluster Network 3
State : Up

Name : Cluster Network 4
State : Up

 

image

 

The interface associated with each of these subnets is included in the appropriate cluster network.  Each of these interfaces has a status based on status reporting in Windows Failover Clustering.

 

[PS] C:\>Get-ClusterNetworkInterface | fl

Name : MBX-1 - LAN-A
Node : MBX-1
Network : Cluster Network 1
State : Up

Name : MBX-2 - LAN-A
Node : MBX-2
Network : Cluster Network 1
State : Up

Name : MBX-1 - REPL-A
Node : MBX-1
Network : Cluster Network 2
State : Up

Name : MBX-2 - REPL-A
Node : MBX-2
Network : Cluster Network 2
State : Up

Name : MBX-3 - REPL-B
Node : MBX-3
Network : Cluster Network 3
State : Up

Name : MBX-4 - REPL-B
Node : MBX-4
Network : Cluster Network 3
State : Up

Name : MBX-3 - LAN-B
Node : MBX-3
Network : Cluster Network 4
State : Up

Name : MBX-4 - LAN-B
Node : MBX-4
Network : Cluster Network 4
State : Up

 

image

 

In this scenario, the Cluster service considers all of the interfaces as “UP.”  Since the interfaces are UP, the Replication service does not failover to the MAPI network even though replication cannot occur over the replication network. 

 

If the Cluster service reports a network as “FAILED” the behavior is different.  On a server hosting a passive database copy, the network cable was removed from the Replication network interface, causing the cluster to mark that interfaces as failed:

 

[PS] C:\>Get-ClusterNetworkInterface | fl

Name : MBX-1 - LAN-A
Node : MBX-1
Network : Cluster Network 1
State : Up

Name : MBX-2 - LAN-A
Node : MBX-2
Network : Cluster Network 1
State : Up

Name : MBX-1 - REPL-A
Node : MBX-1
Network : Cluster Network 2
State : Up

Name : MBX-2 - REPL-A
Node : MBX-2
Network : Cluster Network 2
State : Up

Name : MBX-3 - REPL-B
Node : MBX-3
Network : Cluster Network 3
State : Up

Name : MBX-4 - REPL-B
Node : MBX-4
Network : Cluster Network 3
State : Failed

Name : MBX-3 - LAN-B
Node : MBX-3
Network : Cluster Network 4
State : Up

Name : MBX-4 - LAN-B
Node : MBX-4
Network : Cluster Network 4
State : Up

image

 

The copy status for the databases hosted on MBX-4 is healthy.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus *

Name Status CopyQueue ReplayQueue LastInspectedLogTime ContentIndex
Length Length State
---- ------ --------- ----------- -------------------- ------------
DAG-DB0\MBX-1 Mounted 0 0 Healthy
DAG-DB0\MBX-2 Healthy 0 0 4/29/2014 2:11:47 PM Healthy
DAG-DB0\MBX-3 Healthy 0 0 4/29/2014 2:11:47 PM Healthy
DAG-DB0\MBX-4 Healthy 0 0 4/29/2014 2:11:47 PM Healthy

 

When reviewing the connection status for the databases on MBX-4, we see that the Replication service is using the MAPI network for log shipping.  Servers that have no issues with the Replication interface continue to use that interface.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * -ConnectionStatus | fl name,incominglogcopyingnetwork,outgoingconnections

Name : DAG-DB0\MBX-1
IncomingLogCopyingNetwork :
OutgoingConnections : {}

Name : DAG-DB0\MBX-2
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-3
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-4
IncomingLogCopyingNetwork : {MBX-1,MapiDagNetwork}
OutgoingConnections :

 

When the interface was marked as failed, the Replication service successfully failed over to the MAPI network.

 

But what happens if the interface is not FAILED, but the interface configuration is invalid?   In this event, the Cluster service cannot pass cluster traffic between these two interfaces and the associated network is marked as partitioned.

 

[PS] C:\>Get-ClusterNetwork | fl

Name : Cluster Network 1
State : Up

Name : Cluster Network 2
State : Up

Name : Cluster Network 3
State : Partitioned

Name : Cluster Network 4
State : Up

image

 

The interfaces within the partitioned network are marked as unreachable.

 

[PS] C:\>Get-ClusterNetworkInterface | fl

Name : MBX-1 - LAN-A
Node : MBX-1
Network : Cluster Network 1
State : Up

Name : MBX-2 - LAN-A
Node : MBX-2
Network : Cluster Network 1
State : Up

Name : MBX-1 - REPL-A
Node : MBX-1
Network : Cluster Network 2
State : Up

Name : MBX-2 - REPL-A
Node : MBX-2
Network : Cluster Network 2
State : Up

Name : MBX-3 - REPL-B
Node : MBX-3
Network : Cluster Network 3
State : Unreachable

Name : MBX-4 - REPL-B
Node : MBX-4
Network : Cluster Network 3
State : Unreachable

Name : MBX-3 - LAN-B
Node : MBX-3
Network : Cluster Network 4
State : Up

Name : MBX-4 - LAN-B
Node : MBX-4
Network : Cluster Network 4
State : Up

 

image

 

What happens to the copies that are hosted on the servers with an unreachable interface status?

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * | fl name,status

Name : DAG-DB0\MBX-1
Status : Mounted

Name : DAG-DB0\MBX-2
Status : Healthy

Name : DAG-DB0\MBX-3
Status : DisconnectedAndHealthy

Name : DAG-DB0\MBX-4
Status : DisconnectedAndHealthy

 

An unreachable interface is not the same as a failed interface.  This resulted in the databases entering a disconnected state and replication not failing over to the MAPI network.  The connection status confirms this failure.

 

[PS] C:\>Get-MailboxDatabaseCopyStatus * -ConnectionStatus | fl name,incominglogcopyingnetwork,outgoingconnections

Name : DAG-DB0\MBX-1
IncomingLogCopyingNetwork :
OutgoingConnections : {}

Name : DAG-DB0\MBX-2
IncomingLogCopyingNetwork : {MBX-1,ReplicationDagNetwork01}
OutgoingConnections :

Name : DAG-DB0\MBX-3
IncomingLogCopyingNetwork : {MBX-1,,A timeout occurred while communicating with server 'MBX-1'. Error: "A connection
could not be completed within 15 seconds."}
OutgoingConnections :

Name : DAG-DB0\MBX-4
IncomingLogCopyingNetwork : {MBX-1,,A timeout occurred while communicating with server 'MBX-1'. Error: "A connection
could not be completed within 15 seconds."}
OutgoingConnections :

 

The network failure detection mechanism is the same in both Exchange 2010 and Exchange 2013, as well as the same in Windows 2008 R2, Windows 2012, and Windows 2012 R2 failover clusters.  In order for the Replication service to detect a failure of a Replication network, the operating system and Cluster service must report the underlying interface as failed.  If the Cluster service reports any other status for the interface, the Replication service will consider the network to be valid and replication will not failover to another network.

Comments

  • Anonymous
    January 01, 2003
    @Zoltan:

    Thanks for the comment.

    TIMMCMIC
  • Anonymous
    May 06, 2014
    Thanks Tim
    always always interesting
  • Anonymous
    May 13, 2014
    Pingback from Interesting things that i see on the internet, may 2014. | 503 5.0.0 polite people say HELO
  • Anonymous
    June 05, 2014
    Although MS must have its reasons for implementing this behaviour, to me it seems dodgy. Not being able to communicate on the replication network is a failure as far as the DAG is concerned, regardless of the state of the NIC.

    Great post, thanks Tim for clarifying it.
  • Anonymous
    July 02, 2014
    Great, finally I have the answer why my DAG replica NEVER fall back on MAPI !!!! So, it's completely useless as a detection strategy. The connection is managed by routers and the server are attached to switches and a router outage will never impact the server network card status.
    Why MS does not implement something less primitive such ad "route watch" o "remote network reachability" ?!?
    Regards.

    Red.