1Public Folder Replication Troubleshooting – Part 1: Troubleshooting the Replication of New Changes

Original post on US blog: January 17, 2006

Note: this blog post is the first in a Public Folder Troubleshooting series. Please also make sure to check out part 2, part 3 and part 4 of the series.

Introduction

The purpose of this blog post series is to serve as a guide to troubleshooting public folder replication problems. It will not tell you exactly how to fix every possible replication problem. However, it will show you how to isolate every possible replication problem so that you focus your troubleshooting on the point of failure. Put another way, this post is intended to take you from a problem description like “The content on my old server isn’t replicating to my new server” to a much narrower problem description like “My old server isn’t responding to the status requests from my new server, therefore the new server doesn’t know it’s missing data and isn’t trying to backfill. This means the problem is actually with the old server.” This post will also describe how to identify a few of the most common replication problems. Before I get into the details of troubleshooting, I want to give an overview of my general approach to these issues.

The best troubleshooting tool for public folder replication is the application log. In order to isolate a replication problem, you must be able to follow the replication events in the log to see exactly where the process is breaking down. Typically, you should turn up Diagnostics Logging on Replication Incoming and Replication Outgoing to maximum as a starting point for troubleshooting. Each time a store sends or receives a replication message, it will log an event to that effect (assuming logging is turned up). The various kinds of replication messages can be differentiated by the 'Type' shown in the description of the event. I prefer to focus on the Type rather than the event ID for several reasons:

- Event IDs change between versions of Exchange. For instance, in Exchange 5.5 an outbound backfill request was a 3014. In Exchange 2000 and 2003 it's a 3016.

- Incoming and outgoing event IDs are different for each type. An outgoing hierarchy message is a 3018, while an incoming hierarchy message is 3028.

- Status requests and status messages use the same event IDs, even though they are two different types of messages. Thus, you can't distinguish between them from event ID alone.

Focusing on the Type is a little easier. You can easily correlate these with Event IDs by examining your application log. There are only 7 types, and you can see if the message is outgoing or incoming by looking at the Category of the event. If you focus on types instead of event IDs, all you need to remember is:

Hierarchy - 0x2

Content - 0x4

Backfill Request - 0x8

Backfill Response - 0x80000002 (for hierarchy) or 0x80000004 (for content)

Status - 0x10

Status Request - 0x20

Also note that Replication Errors logging is rarely helpful. Even when replication is working normally, most servers will generate lots of replication error events, such as event ID 3093 indicating there was an error reading a property. In most cases the property has no effect on replication and the error can be ignored. I recommend leaving Replication Errors logging at None unless you're looking for something specific, such as the problem described in this previous blog post.

Troubleshooting the Replication of New Changes

To troubleshoot public folder replication, you must first be familiar with the normal message flow that is expected when replication is working. Based on that knowledge, you can identify the point of failure when a problem arises. Before I discuss how to troubleshoot the replication of new changes, let's describe what the expected behavior is.

Hierarchy Replication

The replication of new hierarchy changes takes place whenever a folder is created or deleted, or a change is made to the properties of a public folder, such as the replica list, client permissions, description, administrative note, or storage limits. Note that this does not include the email addresses on a mail-enabled folder. The email addresses are stored on the directory object in the Active Directory, so changing them does not result in hierarchy replication. Only changing the properties stored in the public store itself will cause hierarchy replication.

Every 15 minutes (by default), the store broadcasts any changes that were made to the folder properties in one or more hierarchy replication messages. With logging turned up to Maximum on Replication Outgoing, you'll see an event like this on the server where the hierarchy was modified:

Event Type: Information

Event Source:     MSExchangeIS Public Store

Event Category:   Replication Outgoing Messages

Event ID:   3018

Description:

An outgoing replication message was issued.

Type: 0x2

Message ID: <91ACCD0758385549A56A10971798985572D5@bilongexch1.bilong.test>

Database "First Storage Group\Public Folder Store (BILONGEXCH1)"

CN min: 1-72CF, CN max: 1-72D3

RFIs: 1

1) FID: 1-6994, PFID: 1-1, Offset: 28

      IPM_SUBTREE\NewFolder

Notice the "Type: 0x2" at the beginning of the description, identifying this as a hierarchy replication message.

A hierarchy replication message is sent from the originating server directly to all other public stores. There's no concept of topology for the replication of new changes. All changes to the hierarchy are sent directly from the server on which the changes were made to all other servers that have a public store associated with the same hierarchy. Every other server should log an incoming event showing a type 0x2 when they successfully process the incoming replication message.

Content Replication

Content replication takes place whenever a message is created or deleted, or the properties of a message are changed. The times at which the store broadcasts content changes for a folder can be modified by changing the replication schedule on that folder, but by default the changes will be broadcast every 15 minutes just like the hierarchy. A content replication message is identified by the type 0x4 in the event description. Once again, there is no concept of topology for the broadcast of new changes. When the content of a folder is modified on any given replica, that replica will send a 0x4 message directly to all other replicas of the folder. And again, every receiving server should log an incoming 0x4 event when they successfully process the incoming message.

Troubleshooting Steps

Those are the two most basic scenarios for replication. When new hierarchy changes or new content changes are not making it from one server to another, troubleshooting is very straightforward.

1. Did the server generate an outbound replication message?

So you made a change to a folder or you added content to a folder on particular server, and that content didn't make it to some other server. The first question to answer is did the target server broadcast the changes? When troubleshooting, it's important to keep track of what server you're working against when you make changes. There are several ways to accomplish this. In ESM, you can right-click on the Public Folders node and choose "Connect To" to point to a particular store. For the most part ESM will make any changes on the specified store, but be aware of one exception - Client Permissions. Prior to Exchange 2003 Sp2, when you changed Client Permissions through ESM, ESM would attempt to make the change against a store that houses a replica of the folder, even though this isn't necessary since the permissions are stored as a property of the folder in the hierarchy. With the 2003 Sp2 version of ESM, this was changed so that it now does make the change on the hierarchy you’ve pointed it to. When you're testing hierarchy replication by making changes through ESM, you should avoid using the permissions for testing since it may be hard to predict which store the change will be made against, unless you’re running ESM from 2003 Sp2. If you're using Outlook and you want to verify which replica of a folder you are hitting, you can use MFCMAPI or a similar tool to view the PR_REPLICA_SERVER property of the folder. This will show you the name of the server that Outlook is using to access the content of that folder.

If the replication schedule is Always for the folder in question (which is always true for the hierarchy), and you don't see an outbound 0x2 or 0x4 within 15 minutes, then you know something is wrong on the originating server. If the server is not generating any outbound hierarchy or content broadcasts, the replication agent may have failed to start. One of the common scenarios is described in KB272999. The important thing to note here is the 3079 event:

Event ID: 3079

Source: MSExchangeIS Public

Type: Error

Category: Replication Errors

Description:

Unexpected replication thread error 0x3f0.

EcGetReplMsg

EcReplStartup

FReplAgent

This will be logged even with no additional logging turned up when you mount the public store. If the 3079 includes the "EcReplStartup" function, this means the replication agent failed to start and no new changes will be broadcast until the problem is corrected and the store is mounted again.

Hierarchy replication is also vulnerable to certain permissions problems when Exchange 5.5 public stores are present in the organization. When an Exchange 2000 or 2003 server sends a hierarchy replication message to an Exchange 5.5 server, it must produce a ptagACLData property (the 5.5-style permissions based on legacyExchangeDN) from the ptagNTSD property (the 2000-style permissions based on SID). This means each SID must be converted to a legacyExchangeDN. This SID to legacyExchangeDN conversion can fail for several reasons. For instance, if a SID resolves to more than one user, an event like this may be generated:

Event ID: 9528

Category: General

Source: MSExchangeIS

Type: Error

Description:

The SID S-1-5-21-408248388-469072634-37170099-1391 was found on 2 users in the DS, so the store cannot map this SID to a unique user.

The users involved are:

/DC=com/DC=company/CN=Users/CN=User1

/DC=com/DC=company/CN=Users/CN=User2

Since the SID can't be converted to a legacyExchangeDN, the store will fail to generate an outbound hierarchy broadcast message.

2. Was the message addressed to the server that didn't receive the change?

If the originating server generated the outbound message, the next step is to be sure it was addressed to the server that didn't receive the data. The easiest way to verify this is to track the message. In message tracking, you can just track the message ID that was reported in the outbound replication event. In the Message History window, the To: line will be visible. If the store that didn't receive the changes is not listed here, then again the focus should be on the originating server. Does it see that server in the organization? Does that server have email addresses on its public store object? Does the originating server see that store in the replica list on the folder in question?

3. Did the message make it to the destination server?

Once you've verified that the message was addressed to the destination server, the next question to answer is - did the message make it that far? You can determine this through message tracking. If message tracking says the message was delivered to the store, but you don't see an incoming replication event acknowledging the message, see the "Common Problems" section in the last post of this series.

In next blog post: Troubleshooting the Replication of Existing Data.

- Bill Long