on Premise to Office365 Mail flow delay troubleshooting

Are you moving to a Hybrid solution with office365, you might want to make sure you have reviewed this configuration properly.

If you are in Hybrid environment, not just office365 and on premise Exchange 2007/2003 environment but also when you have an on premise co-existence setup with Exchange 2010 and  Exchange 2007/2003 you might see queue building up in Exchange 2007/2003 side sending over to the Exchange 2010 co-existence HUB servers.

If you review the message headers you could clearly see the delay hops.

 

Delay

From

To

Protocol

Time  received

-48  sec

unknown 

Application  Server

SMTP

Wednesday,  January 11, 2012 10:04:58 AM

-67 mins

Application  Server

On  premise Scan Server

ESMTP

Wednesday,  January 11, 2012 8:58:08 AM

6 sec

On  premise Scan Server

Exchange  2007 HUB server

Wednesday,  January 11, 2012 10:08:42 AM

4 mins

Exchange 2007 HUB server

Co-Existence Exchange 2010 Server

Wednesday, January 11, 2012 10:12:53 AM

-51  sec

External  Gateway

TX2EHSMHS015.bigfish.com 

Wednesday,  January 11, 2012 10:12:02 AM

2 mins

unknown 

mail187-tx2.bigfish.com 

ESMTP

Wednesday,  January 11, 2012 10:14:14 AM

5 sec

localhost 

mail187-tx2-R.bigfish.com 

ESMTP

Wednesday,  January 11, 2012 10:14:19 AM

2 sec

mail187-tx2-R.bigfish.com 

HKNPRD0202HT001.apcprd02.prod.outlook.com 

Wednesday,  January 11, 2012 10:14:21 AM

37 sec

HKNPRD0202HT001.apcprd02.prod.outlook.com 

HKNPRD0202HT002.apcprd02.prod.outlook.com 

Wednesday,  January 11, 2012 10:14:58 AM

1 sec

HKNPRD0202HT002.apcprd02.prod.outlook.com 

SINPRD0204HT004.apcprd02.prod.outlook.com 

Wednesday,  January 11, 2012 10:14:59 AM

 

You might have had users complain that mails from on premise to office365 is slow and you might experience the same from your application servers sending out or from your scanners and printers to office365 mailbox. So what's happening?

In an Hybrid environment, Your first step of troubleshooting might be to look at the connectors between your on premise servers and the co-existence Exchange 2010 HUB/CAS server.

  1. Monitor the message tracking log and review the messager header to see the exact time delay between on premise sender and office365 recipient.
  2. Monitor the queue regularly to differentiate its only on the connector to office365 or to internal routing and internet as well.
  3. Monitor the Network between the on premise Hub server and Exchange 2010 Co-existence servers.
  4. Monitor the Exchange server performance of the Exchange 2007 Hub server on premise. --
    1. Logman.exe create counter MSPerf-1sec -f bincirc -max 800 -c "\LogicalDisk(*)\*" "\Memory\*" "\Processor(*)\*" “\MSExchangeADAccess Domain Controllers(*)\*” “\MSExchangeTransportQueues(*)\*” “\MSExchangeTransport Resolver(*)\*” “\MSExchangeTransportRouting(*)\*” “\MSExchangeTransport SMTPReceive(*)\*” “\MSExchangeTransportSMTPSend(*)\*” -si 00:00:01 -o c:\PerfMonLogs\MSPerf-1sec.blg
  5. Enable the verbose option of the SMTP logging to gather more information
    1. Set-SendConnector "Send Connector Name" -ProtocolLoggingLevel verbose
    2. Set-ReceiveConnector "receive Connector Name" -ProtocolLoggingLevel verbose
      https://technet.microsoft.com/en-us/library/bb124531(v=EXCHG.80).aspx
  6. Review the event logs on the Exchange 2007 server for Event ID 15004 warning. This indicates a back pressure issue on the Exchange server.
    1. Time: 2012/2/8 5:39:23
      ID: 15004
      Level: Warning
      Source: MSExchangeTransport
      Machine: <ServerName>
      Message: Resource pressure increased from rmal to High.

Resource utilization of the following resources exceed the normal level:
Version buckets = 116 [High] [Normal=40 Medium=60 High=100]

Back pressure caused the following components to be disabled:
Inbound mail submission from Hub Transport servers
Inbound mail submission from the Internet
Mail submission from the Pickup directory
Mail submission from the Replay directory
Mail submission from Mailbox servers
Mail delivery to remote domains

This would normally put an halt on exchange until the version buckets return to normal condition. This might be due to several reasons:

  1. Disk latency issue.
  2. The Queue database is corrupted.
  3. CPU utilization is high.
  4. Antivirus.

We could isolate that Disk Latency isnt an issue based on our performance logs.To further troubleshoot,
1. Stop all the Antivirus services
2.. Disable all non-default transport agents running on the Exchange server, including the disclaimer and Forefront security for Exchange.
       To do this:
       Stop and disable all the Forefront related services in the Service console.
       Disable disclaimer agent using Disable-TransportAgent cmdlet, for example: 
       Disable-TransportAgent -Identity “Disclaimer Agent”
       Disable-TransportAgent : https://technet.microsoft.com/en-us/library/aa997880(v=exchg.80).aspx
3
. Restart the MS Exchange Transport service in the Service console to take the change effect.
4. Monitor the server for the update

Flush the queue and monitor the queue status to see if the messages are still being stuck at the connector. If it doesn't help, read on....

Eventually you will notice that after reviewing the SMTP logs in verbose mode, the lag is only between the on premise mailboxes and the office365 mailboxes. Huh, so what is causing this, it’s a simple internal connector between on premise Exchange 2007 and co-existence server 2010 routing back to office365. This is simple routing, what is the big difference here??

In Hybrid scenario, when an e-mail message is received, a Hub Transport server resolves the recipient e-mail address on the message to a recipient object. If the recipient object is an on-premises mailbox or distribution group, the message is delivered to the recipient. If the recipient object is a mail user that's associated with a mailbox in the cloud-based organization, Exchange reviews the target delivery address (eg @service.contoso.com) of the mail user and redirects the message to the cloud-based organization. The message is passed to the hybrid server and is then delivered to the cloud-based organization and delivered to the cloud-based mailbox. See the following figure for an example of the message flow.

We are seeing the queue and delay in the last part of the step where the message is passed to the hybrid sever from the on premise server. That might give a clue that this might be related to some configuration on the conenctor from on premise 2007 HUB server to the 2010 co-existence server.

2012-02-15T09:44:31.537Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,47,192.168.1.10:25,192.168.1.12:54116,*,08CEA5BE74A2BE8C;2012-02-15T09:44:31.427Z;1,receiving message
2012-02-15T09:44:31.537Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,48,192.168.1.10:25,192.168.1.12:54116,<,RCPT TO:<eric@service.contoso.com> ORCPT=rfc822;eric@contoso.com,
2012-02-15T09:44:31.537Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,49,192.168.1.10:25,192.168.1.12:54116,>,250 2.1.0 Sender OK,
2012-02-15T09:44:31.537Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,50,192.168.1.10:25,192.168.1.12:54116,>,250 2.1.5 Recipient OK,
2012-02-15T09:44:31.537Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,51,192.168.1.10:25,192.168.1.12:54116,<,BDAT 3246 LAST,
2012-02-15T09:45:00.944Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,52,192.168.1.10:25,192.168.1.12:54116,*,Tarpit for '0.00:00:30.172' due to 'DelayedAck',Delivered
2012-02-15T09:45:00.944Z,EX20101\Default EX20101,08CEA5BE74A2BE8C,53,192.168.1.10:25,192.168.1.12:54116,>,250 2.6.0 <NEPTUN@contoso.com> [InternalId=1260208] Queued mail for delivery,

Tarpitting
=======
Delayed Acknowledgement is an attempt made by Exchange 2010 Transport servers to protect messages received from less sophisticated mail servers. This is accomplished by making the sending server wait while the message is delivered behind the scenes of the 2010 environment.

Tarpitting Functionality
https://technet.microsoft.com/en-us/library/bb123891.aspx

To combat directory harvest attacks, Exchange 2010 includes tarpitting functionality. Tarpitting is the practice of artificially delaying server responses for specific SMTP communication patterns that indicate high volumes of spam or other unwelcome messages.

MaxAcknowledgementDelay
======================

The MaxAcknowledgementDelay parameter specifies the maximum period the transport server delays acknowledgement until it verifies that the message has been successfully delivered to all recipients. When receiving messages from a host that doesn't support shadow redundancy, an Exchange Server 2010 transport server will delay issuing an acknowledgement until it verifies that the message has been successfully delivered to all recipients. However, if it takes too long to verify successful delivery, the transport server will time out and issue an acknowledgement anyway.

The default value is 30 seconds.

https://technet.microsoft.com/en-us/library/dd351046.aspx

https://technet.microsoft.com/en-us/library/bb125139.aspx

Shadow Redundancy Mail Flow Scenarios
 https://technet.microsoft.com/en-us/library/dd351091.aspx

Exchange Server 2010 introduces the shadow redundancy feature to provide redundancy for messages for the entire time they're in transit. The solution involves a technique similar to the transport dumpster. With shadow redundancy, the deletion of a message from the transport databases is delayed until the transport server verifies that all of the next hops for that message have completed delivery. If any of the next hops fail before reporting back successful delivery, the message is resubmitted for delivery to that next hop.

Unfortunately, Neither Exchange Server 2007 transport servers nor Exchange Server 2003 bridgehead servers support shadow redundancy. Therefore, if you have a coexistence scenario with previous versions of Exchange, Exchange 2010 redundancy features can guarantee message delivery only until the legacy Exchange hop, and not all the way to its destination. The same applies to the scenario where Exchange 2010 Edge Transport servers send messages to non-Exchange mail servers. As Exchange 2007 doesn’t support transport shadow redundancy, so there is a possibility that an Exchange 2010 server may delay acknowledgement when it receives emails sent from an Exchange 2007 server.

we need to run the following 2 commands on both Exchange 2010 (NLB) servers to disable these features:

Set-ReceiveConnector "Default EX2010" -TarpitInterval 00:00:00
Set-ReceiveConnector "Default EX2010" -MaxAcknowledgementDelay 0

Restart the transport service on both Exchange 2010 server and then see if there are still messages queued on Exchange 2007.

 

you could check out the queue and can find that the delay has gone and the mail flow is much faster. You could also analyze the message Header and realize the drop in the dealy.

 

Delay

From

To

Protocol

Time received

Exchange  2007 HUB Server

Exchange 2007 HUB Server

mapi

Thursday,  16 February, 2012 10:13:21 AM

1 sec

Exchange 2007 HUB Server

Co-existence Exchange 2010 Server

Thursday, 16 February, 2012 10:13:22 AM

-66  sec

External  Gateway

TX2EHSMHS038.bigfish.com 

Thursday,  16 February, 2012 10:12:16 AM

1 sec

unknown 

mail88-tx2.bigfish.com 

ESMTP

Thursday,  16 February, 2012 10:12:17 AM

1 sec

localhost 

mail88-tx2-R.bigfish.com 

ESMTP

Thursday,  16 February, 2012 10:12:18 AM

2 sec

mail88-tx2-R.bigfish.com 

HKNPRD0202HT004.apcprd02.prod.outlook.com 

Thursday,  16 February, 2012 10:12:20 AM

HKNPRD0202HT004.apcprd02.prod.outlook.com 

HKNPRD0202HT003.apcprd02.prod.outlook.com 

Thursday,  16 February, 2012 10:12:20 AM

HKNPRD0202HT003.apcprd02.prod.outlook.com 

HKNPRD0204HT002.apcprd02.prod.outlook.com 

Thursday,  16 February, 2012 10:12:20 AM

 Make sure you restart the Exchange transport services and flush the queue. Also monitor the event logs for the warnings. 

Another easier way is you could totally turn off shadow redundancy features in exchange 2010.

Set-TransportConfig –ShadowRedundancyEnabled $false

-Ram

Comments

  • Anonymous
    January 01, 2003
    Hi mrams,

    Thanks a lot. It works.

  • Anonymous
    May 04, 2012
    Hi, I want to know strategy for Hub Transport Message Queue Site Resiliency options in the event of loss of a whole data center while there are still messages in transit on Hub transport server. How we can protect messages which are still in HT queue and they are not yet send over to MBX server (Shadow redundancy process not yet started). We need to achieve Zero Message loss. Is it possible anyway to send a copy of message which received on HT or ET level  from Site A to other HT/ET at Site B ??  I know about Shadow redundancy Promotion (SRP)….can we somehow configure ET/HT layer using SRP shadow message to remote site?? We already got 2 Active data centers with Exchange 2010 servers except HT layer all other layers are redundant and losing a messages at other layer is already covered. I would really appreciate any help. Thanks, JS

  • Anonymous
    October 30, 2013
    Hi mrams, Thanks a billion.  we had the same issue as described in the article and I ran the two commands (tarpitting and Maxacknowledgement delay) and after that the queue had dropped down very quickly from 16000 mails to 10000 emails in a matter of about 15 mins. Regards Rama