Sometimes Replication Monitor shows number of Undistributed Transactions as a very high number, but replication itself works properly.
An overview of the inner working of a replication monitor
We might see large number of Pending transactions on one of the Publisher in P2P replication. But when we check the Replication Monitor for second server (Subscriber) it might show a small value (which would be the actual value). You might see that the latency on both sides are which is very less.
All of these point that there might a wrong notification from the Replication monitor.
First we can try the manual way to see how many transactions are actually undistributed.
select * from msreplication_subscription (on the subscriber database)
· We would get the value under transaction_timestamp column.
· Remove the trailing 0's and try the following
select count(*) from msrepl_transactions where xact_seqno > <seq_no> (on distribution database)
· It would return the number of rows which we see in Replication monitor on Subscriber
Then we can try to use TracerToken to check the latency
STEPS
====
sp_posttracertoken @publication='<Publication Name>'
sp_helptracertokens @publication='<Publication Name>' ---> This would give the Tracer_ID
sp_helptracertokenhistory @publication='<Publication Name>',@tracer_id='<number>'
sp_deletetracertokenhistory @publication='<Publication Name>',@tracer_id='<number>' --> This is to clear the traced token
=> We can try this few times and confirm a low latency. A low latency would mean that replication is working fine.
Above steps would help identify that the replication itself is working fine but there is some problem in replication monitor when we see values
-----------------
The Replication Monitor runs the following stored procedure to show number of undistributed transactions.
sys.sp_replmonitorsubscriptionpendingcmds
If you look at the text of the stored procedure. We would see that it is basically only looking at the database MSDistribution_history
Here I have prepared a reduced form of this Stored procedure which would do the same thing that the replication monitor does (in terms of showing Undistributed transaction)
<Reduced Form of sys.sp_replmonitorsubscriptionpendingcmds>
declare @avg_rate int
declare @retcode int
DECLARE @mstestcountab TABLE ( pendingcmdcount int )
select @avg_rate = isnull(avg(<delivery_rate>),0.0)
from dbo.MSdistribution_history
where agent_id = 3
insert into @mstestcountab (pendingcmdcount)
exec @retcode = sys.sp_MSget_repl_commands
@agent_id = 3
,@last_xact_seqno = <xact_seqno>
,@get_count = 2
,@compatibility_level = 9000000
select pendingcmdcount from @mstestcountab
select
pendingcmdcount
,N'estimatedprocesstime' = case when (@avg_rate != 0.0)
then cast((cast(pendingcmdcount as float) / @avg_rate) as int)
else pendingcmdcount end
from @mstestcountab
<Reduced Form of sys.sp_replmonitorsubscriptionpendingcmds>
=> The Delivery Rate and the XACT_Seqno in the above script is got from the MSDistribution_History table.
=> You can used the following to get that information
select xact_seqno, delivery_rate from dbo.MSdistribution_history
where agent_id = <give the agent ID number) and runstatus in (2,3,4)
=> When you feed that result to the script, you would see that the number returned is same that of the Replication Monitor.
=> So now we would know that the MSdistribution_history is not updated properly and the Xact_Seqno is a very old transaction and it should be updated with the latest replicated transaction.
=> The fact is that we have something called sp_MSadd_distribution_history which is also executed by sp_MSagent_stethoscope which is in turn executed by sp_replication_agent_checkup.
=> So we should have a job called Replication Agents Checkup.' which runs every 10 secs and this should be doing the job of updating the proper xact_seqno
=> In most cases where we see this issue, that Job is not running or is running but with the following message.
<Message>
The replication agent has not logged a progress message in 10 minutes. This might indicate an unresponsive agent or high system activity. Verify that records are being replicated to the destination and that connections to the Subscriber, Publisher, and Distributor are still active. [SQLSTATE 01000] (Message 20554). The step succeeded.
<\Message>
Steps we could take from here to get correct values into the MSDistribution_History
a) Restart the Agents to see if that helps
b) Collect a detailed profiler trace when the "Replication Agents Checkup" job is running and we have to see if the SP_MSAdd_Distribution_history is run or not.
c) We can try and delete the rows manually. Then the appropriate job should insert and then we see if its updated subsequently.
<STEPS>
1) Take a backup of the Distribution Database.
2) Create a test table (temp_MSDistribution_history) with same schema as the original
MSDistribution_history table.
3) Select all rows from the MSDistribution_History table and insert into the newly created
table.
4) Manually delete the rows in the Distribution history for the specific agent ID that we figured
out the last time.
5) Run the Replication Agents Checkup and see if a new row for that agent ID is inserted
with the upto date xact_seqno.
Deleting rows from the MSDistribution history would not cause any problem on the replication.
But to ensure proper safety, I suggest to backup the Distribution and if deleting the rows
does not help, then we can always re-insert the deleted rows from the
temp_MSDistribution_history to MSDistribution_history.
<\STEPS>
Note: In most cases, a restart of the agents resolves the issue.
Regards,
Gourav Das
Support Engineer, Microsoft SQL Server.
Reviewed by,
Ouseph Devis T & Akbar Farishta
TL, Microsoft SQL Server support
Gaurav Mathur
SEE, Microsoft SQL Server support
Comments
- Anonymous
March 25, 2015
Really good article very helpful but the code shown is full of errors.