Replication Enhancement – Improved Distribution Database Cleanup

Replication Distribution database is the perhaps the most important entity in the entire replication topology. Apart from storing the replication metadata and history for replication agents, the distribution database, in case of transactional replication, also provides intermediate storage for the transactions and commands that will be shipped to the subscribers. These transactions and commands stored in the MSRepl_Transactions and MSRepl_Commands tables. In a transactional replication topology which involves a highly active Publisher (and Publication database), i.e. a publisher with high numbers of transactions/second, the MSRepl_Commands and MSRepl_Transactions tables on the distribution database can grow substantially large.

The Distribution Cleanup job, which runs every 10 minutes on the Distributor Server is responsible for cleaning up the transactions and commands based on the specified transaction retention period, which by default is 72 hours. Additionally, the distribution cleanup job also does the cleanup of the expired subscriptions on the distributor, which is an expensive operation, especially given that its running every 10 minutes.

NOTE: If the same publication database has multiple different transaction or snapshot publications, and the distribution agents for these publications are configured to run at different intervals (for example one is configured to weekly, the second is configured to run daily, while the third is running continuously) its possible that the cleanup will not delete data which is older than 72 hours. This is because of the cleanup job calculates the @max_cleanup_xact_seqno based on the Publisher database and not on the individual distribution agents for the publications on this publisher database.

In case of a highly active environment the cleanup job would need to delete a substantially large set of expired transactions and commands. The cleanup job uses a while loop to delete entries from the MSRepl_Commands and MSRepl_Transactions in batches of 2000 and 5000 rows, respectively. Depending on how large the MSRepl_Commands and MSRepl_Transactions Tables are, the cleanup activity can be cumbersome and time consuming, leading to several performance issues like lock blocking and sometimes replication agent failures. These hard-coded batch sizes work well when the size of the replication tables is small, but do not perform well when these replication tables become large, let's say like 200-300 million rows.

The new implementation of the procedure allows users to parameterize the batch sizes inside the cleanup stored procedure and introduces a new adaptive approach to determine the batch sizes for each iteration. The procedure will, by default, honor any batch sizes that are mentioned explicitly as parameters. If no parameters are supplied, then the procedure will start with the default batch sizes of 2000 and 5000 rows and increase/decrease the batch sizes based on the performance of the previous iteration of the delete operation. If the time taken by the delete query improves by 50% compared to previous execution, the batch size value is increased by 20% up to a max value of 50000 rows per batch, and if the performance of the delete operation decreases by 20% as compared to the previous iteration, the batch size is decreased by 50% up to a default value of 2000/5000 rows per batch for MSReplCommands and MSRepl_Transactions, respectively.

The new implementation also separates the subscription cleanup task from the main distribution cleanup work, by moving the subscription cleanup to the Expired subscription cleanup job that runs at the publisher. The Expired Subscription cleanup job which runs once a day, makes a connection to the distribution database server to execute the cleanup subscription stored procedure.

Finally, In the older implementation the cleanup job reports the row counts based on the last iteration of the while loop and not the cumulative values for the rows deleted. This has been changed to accumulate and report out the cumulative numbers. Also, the rate of cleanup that is displayed in the message has been changed from rows/sec to rows/millisecond as this provides more accurate rate. For e.g., if the elapsed time is less than a second, the existing procedure reports as 0 records deleted in 0 seconds. This has been changed so that we would see something like 'n' records deleted in 'm' milliseconds.

To support the improvement, two new parameters @deletebatchsize_xact and @deletebatchsize_cmd were added to sp_adddistributiondb stored procedure mentioned here.. These parameters control the delete batch sizes for the MSRepl_Transactions and MSRepl_Commands tables respectively. Additionally, SSMS was updated to support the changes.

This improvement was first released in SQL Server 2017,  followed by releases in SQL Server 2012 SP4 and SQL Server 2016 SP2. This improvement would be available in an upcoming servicing release for SQL Server 2014.

Sourabh Agarwal (@SQLSourabh)
Senior PM, SQL Server Tiger Team
Follow us on Twitter: @mssqltiger | Team Blog: Aka.ms/sqlserverteam