Jaa


Public Folder Replication Troubleshooting – Part 3: Troubleshooting Replica Deletion and Common Problems

Original post on US blog: January 24, 2006

This is a third blog post about troubleshooting public folder replication issues. In first post we covered Troubleshooting the Replication of New Changes. The second blog post covered Troubleshooting the Replication of Existing Data. This post in this series will cover Troubleshooting Replica Deletion and Common Problems. To get the full picture, please read all referenced material!

Troubleshooting Replica Deletion

You removed an old server from the replica list on all your folders. However, when you go to Public Folder Instances for the old store in ESM, you still see a bunch of folders there. This is due to a problem with the replica deletion process. In the Exchange 2003 Sp2 version of ESM, if you try to delete a public store in this state, ESM presents a dialog stating:

“You cannot delete this public folder store because it contains folder replicas. To avoid data loss, right click the public folder store and use Move Replicas to move the replicas to a different server. It may take several hours until the content is replicated to the new server and the local replicas are removed.”

When you remove a store from the replica list on a folder, that store does not immediately delete the data. Instead, it sends out a special 0x20 status request to all the other replicas. This is called a Replica Delete Pending Status Request (RDPSR), and can not be distinguished from a normal status request in the application log. A RDPSR contains a flag that indicates the replica is pending deletion. When the other stores receive this 0x20, they respond with a special 0x10 called a Replica Delete Pending Ack (RDPA). The RDPA indicates that it’s ok to delete that data – but the other stores only send this 0x10 if they already have all the CNs that the pending deletion replica has. The replica will only be deleted once the store has received a 0x10 indicating that someone else has the data.

This means that if you delete the store before Public Folder Instances is empty, you are probably losing data. Only 2003 Sp2 ESM will stop you from doing this – in older versions you must manually check Public Folder Instances to see if it’s ok to delete the store. You should always check Public Folder Instances before deleting a public store, and when 2003 Sp2 ESM gives you this warning, you should not try to ignore it or work around it – instead, you should troubleshoot the replica deletion process.

Note that prior to Exchange 2003 Sp2, the server that was removed from the replica list only sends the RDPSR once. If no one responds, you’ll see that the folder just stays in Public Folder Instances indefinitely, unless you add the store back to the replica list and then remove it again, causing a new RDPSR to be sent. 2003 Sp2 changed this behavior so that the store will retry every hour until it gets a RDPA from someone.

Troubleshooting

This is almost the same as troubleshooting the backfill process.

1. Did the pending deletion replica send a 0x20?

Unless you already had logging turned up when you removed the replica, you won't know. Fortunately, you can just add the replica back and then remove it again. Then watch the application log for the 0x20.

2. Did the 0x20 reach the other replicas?

You should know the drill by now. Check the application logs on the other replicas to see if they received the 0x20.

3. Did any other replica respond with a 0x10?

This is the part you'll probably end up focusing on. If a replica received the 0x20 from the pending deletion replica, but did not respond with a 0x10, that means that the pending deletion replica has data that the other replica doesn't. Since you know it just received a 0x20 from that replica, then you also know that it already knows what data it's missing. Therefore, you'd expect to see a backfill request for that folder every 24-48 hours. Watch the application log, and troubleshoot it exactly like the normal backfill process described earlier.

4. Did the pending deletion replica receive the 0x10?

Once any other replica has all the data, that replica should respond with a 0x10. When the pending deletion replica receives that 0x10, it will finally be willing to delete that data. That doesn't mean it will happen immediately, though. If there are clients using that replica, it won't be deleted until later during online maintenance. If you want, you could speed this up by dismounting and mounting the store to disconnect the clients.

Common Problems

You may find that a server sent some type of replication message to another server, but the receiving server never logged the incoming message in the application log. However, message tracking says it was delivered locally to the store on that server. This behavior usually indicates either a problem with the Replication State table or a permissions problem on the SMTP virtual server.

Let's cover the easy one first. One problem that causes an incoming replication message to be ignored by the receiving server is a problem with the Replication State, or ReplState, table. Note that a problem with the ReplState table may also cause the server to fail to issue backfill requests (0x8) for some folders, so this information also applies to that situation. Each public store uses its ReplState table to track the state of replication for any replicated folders. The table contains multiple rows for each folder - one row per replica. It's possible for the rows in the ReplState table to get out of sync with the replica list, such that it has extra or missing rows. Sometimes you can get it to sync up again just by making a change such as removing a server from the replica list, applying the change, and then immediately adding it back, but this doesn't always work. Fortunately, a ReplState test was added to isinteg. See KB889331 for Exchange 2003, or KB892485 for Exchange 2000. As long as you have the updated isinteg.exe and store.exe, you can use isinteg to correct the problem with the ReplState table. If you run only the ReplState test, it is typically very fast - less than a minute even on a large public store. Once isinteg has been run, you may still need to go back and make a change to the folder to get the ReplState table to sync up with the replica list. After they're in sync, the server should be able to process the incoming replication messages, or should begin issuing backfill requests normally.

The other common problem that causes an incoming replication message to be ignored is an issue specific to Exchange 2003. An Exchange 2003 server requires that the sending server has the Send As right on the receiving SMTP virtual server. That is, if ServerA is Exchange 2003, and ServerB is sending a PF replication message to ServerA, ServerB must have Send As on ServerA's SMTP virtual server. Otherwise, ServerA does not process the incoming replication message. This permission is normally granted through the Exchange Domain Servers groups. If the Send AS right is the problem, all incoming replication messages from a particular server will fail. I find it easiest to identify this problem with a network trace taken while a replication message is being transferred from one server to another. The conversation should go like this:

ServerA: 220 ServerA.microsoft.com Microsoft ESMTP MAIL Service...

ServerB: EHLO ServerB.microsoft.com

ServerA: 250-ServerA.microsoft.com Hello

         250-TURN

         250-SIZE

         250-ETRN

         250-PIPELINE

         250-DSN

         250-ENHANCEDSTATUSCODES

         250-8bitmime

         250-BINARYMIME

         250-CHUNKING

         250-VRFY

         250-X-EXPS GSSAPI NTLM LOGIN

         250-X-EXPS=LOGIN

         250-AUTH GSSAPI NTLM LOGIN

         250-AUTH=LOGIN

         250-X-LINK2STATE

         250-X-EXCH50

         250 OK

The important part here is that ServerA must be advertising the GSSAPI NTLM LOGIN options. If you don't see these in ServerA's response to the EHLO, it's usually because Integrated Windows Authentication has been unchecked on the SMTP virtual server. This is mentioned in step 1 of KB843106 and step 3 of KB842273. As long as these authentication verbs appear, you should see ServerB try to use them:

ServerB: X-EXPS GSSAPI

ServerA: 334 GSSAPI supported

ServerB: <a bunch of base64 encoded data>

ServerA: 334 <more base64 encoded stuff>

ServerB: CRLF

ServerA: 235 2.7.0 Authentication successful.

If authentication does not succeed, you may have a kerberos problem or an issue with the computer account for ServerB. Next the servers will transmit linkstate information. After that, they finally get around to the business of transferring email:

ServerB: MAIL FROM:<ServerB-IS@microsoft.com>

ServerA: 250 2.1.0 ServerB-IS@microsoft.com....Sender OK

ServerB: RCPT TO:<ServerA-IS@microsoft.com> NOTIFY=NEVER

ServerA: 250 2.1.5 ServerA-IS@microsoft.com

ServerB: XEXCH50 2404 2

ServerA: 354 Send binary data

It's this last response to the XEXCH50 verb that's important. If the response is "354 Send binary data", then everything is fine, at least as far as permissions to the SMTP virtual server are concerned. If the GSSAPI NTLM login options were not advertised, or the authentication attempt failed, then it's expected that ServerA will instead respond with "504 Need to authenticate". If those steps succeeded, but ServerA still says "504 Need to authenticate" instead of "354 Send binary data", then ServerB does not have the Send As right on ServerA's SMTP virtual server. There are several ways this could happen. For one, when you delegate rights such as Exchange Full Administrator in ESM, that user or group inherits a deny on the Send As right. Therefore, using ESM to delegate admin rights to the computer account, the Exchange Domain Servers group, or some other group that contains the Exchange servers will break public folder replication. Another possibility is that the computer account is not in the Exchange Domain Servers group, which is how it normally has the Send As right. You'll need to evaluate the permissions on the SMTP virtual server and determine why the computer account for the sending server does not have the proper rights. See KB843106 and KB842273 for more details about the "504 Need to authenticate" problem.

Conclusion

You may have noticed as you read through this document that Sp2 for Exchange 2003 contains several important enhancements under the hood to prevent replication issues and assist in troubleshooting them. Environments with multiple public stores can really see a huge benefit from Sp2, especially when it comes to moving replicas between servers, and adding and removing public stores.

I hope this was helpful. Thanks to Dave Whitney for reviewing all this!

- Bill Long