CSI: Replication–the mysterious vanishing row

I have been working with a customer on an interesting situation. On occasion they have replication issues. Examining the error we see many errors that state:

The row was not found at the Subscriber when applying the replicated command

So we need to understand why the row was not found. (cue TV show music)…. CSI: Replication

We fire up SQL Server Management Studio and connect to the distributor. We then issue the following T-SQL code

 use distribution;
go
select * from MSRepl_Errors

This returns us the transaction sequence number of the transaction in our publisher which cannot be replicated. We want to see what table is being involved so we are going to have to use an undocumented function to examine the transaction log in order to uncover that clue…

There are many great posts on the web about ::fn_dblog which explain this undocumented function. Most notably Paul Randal’s blog post here.

In Paul’s post he provides code to go from the result set back to a hexadecimal representation of the xact_seqno. However, we want to search the output of the call to fn_dblog and in order to do so, we have to convert the format returned by MSRepl_Errors (0x0000004E00000048000400000000) into the colon separated hexadecimal format that fn_dblog returns (0000004e:00000048:0004).

To do this we can use the following code

 use distribution
go

declare @LSN varbinary(16)
select TOP 1 @LSN = xact_seqno 
    from MSrepl_errors 
    group by xact_seqno, time 
    order by time desc
select @LSN

-- Now we have to convert it to the Current LSN format that fn_dblog uses
declare @convertedLSN varchar(max)
select @convertedLSN = [sys].[fn_varbintohexstr](@LSN)

SELECT @convertedLSN =
       SUBSTRING(@convertedLSN, 3,8 ) +':'+
       RIGHT( '0000000000' + SUBSTRING( CAST(@convertedLSN as varchar), 11, 8 ), 8 ) + ':' +
       RIGHT( '00000'      + SUBSTRING( CAST(@convertedLSN as varchar), 19, 4), 4 ) 
 --fn_dblog uses the current context to decide which transaction log to read.
  use AdventureWorks2008R2;

select * from
fn_dblog (NULL,NULL) Where [Transaction ID] in
    (select [Transaction ID] from fn_dblog(NULL,NULL)
        where [Current LSN] = @convertedLSN) 

In my case we can see that a row in the Sales.SalesTaxRate table has been modified at the publisher, but cannot be replicated to the subscriber.

 .
 Current LSN                    Operation        Transaction ID   AllocUnitName
0000004e:00000048:0001    LOP_BEGIN_XACT        0000:000017ec    NULL
0000004e:00000048:0003    LOP_MODIFY_ROW        0000:000017ec    Sales.SalesTaxRate.PK_SalesTaxRate_SalesTaxRateID
0000004e:00000048:0004    LOP_COMMIT_XACT       0000:000017ec    NULL

Now we must locate witnesses on the subscriber who might have seen what happened to our row.

So, we get our forensics kit and head out to the subscriber

We need to use our new friend fn_dblog to examine the transaction log of the subscriber using the clues we obtained above.

 declare @tablename nvarchar(256) = '%Sales.SalesTaxRate%' -- NOTE: the % because the AllocUnitName will also contain the object name (i.e. Index name)

-- fn_dblog uses the current context to pick the log to view
-- So, set your database name here
USE AdventureWorks2008R2;

-- Now lets see the log records for deleted rows for this table
select  
    SUSER_SNAME([Transaction SID]) as [User]
    ,SPID
    ,ISNULL([AllocUnitName],'Delete Transaction') + 
        case when Operation='LOP_BEGIN_XACT' then ' Start' 
             when Operation='LOP_ABORT_XACT' then ' Aborted' 
             when Operation='LOP_DELETE_ROWS' then ' Deleting'
             else ' Committed' end as [Description]
    ,[Begin Time]
    ,[Current LSN]
    ,[Operation]
    ,[Context]
    ,[Transaction ID]
    ,[Description]
    ,[Transaction ID]    
    ,*                                                -- Leave this here in case we want other columns
from ::fn_dblog(NULL, NULL) 
    where [Transaction ID] IN
    (select [Transaction ID] from ::fn_dblog(NULL,NULL)
        where [Operation]='LOP_DELETE_ROWS' and [AllocUnitName] LIKE @tablename )
GO

We are getting closer…

After interrogating our suspects we obtain the following evidence (output abbreviated for clarity)

User                          SPID    Description                                                                                                           Begin Time
REDMOND\bobtay    51        Delete Transaction Start                                                                                      2013/09/12 09:54:10:563
NULL                         NULL   Sales.SalesTaxRate.AK_SalesTaxRate_StateProvinceID_TaxType Deleting    NULL
NULL                         NULL   Sales.SalesTaxRate.AK_SalesTaxRate_rowguid Deleting                                 NULL
NULL                         NULL   Sales.SalesTaxRate.PK_SalesTaxRate_SalesTaxRateID Deleting                    NULL
NULL                         NULL   Delete Transaction Committed                                                                             NULL

It appears that boB Taylor is our suspect and deleted the row that we needed to replicate!

But wait, our suspect is really quite clever. After he deleted the row, the issued a checkpoint command and backed up the transaction log.

In this case, the output of the fn_dblog query above will return zero rows.

Time for advanced forensic tools – fn_dump_dblog. Paul Randal describes all the parameters in hisblog post here so I won’t duplicate that but essentially it performs a similar analysis for the transaction log that has been backed up.

 DECLARE @tranlog nvarchar(1024) = N'C:\Program Files\Microsoft SQL Server\MSSQL11.SQL2K12SP1\MSSQL\Backup\MyTestDB.trn'

-- Also, if you are appending to your backup set with each tran
-- log backup you will need to view the file with HEADER_ONLY to determine which file in the sequence to use.
-- you can examine the time frames by using the restore UI to see the files in the backup set and the time frames that 
-- they cover.
DECLARE @FileNumber int = 1

select
    SUSER_SNAME([Transaction SID]) as [User]
    ,SPID
    ,ISNULL([AllocUnitName],'Delete Transaction') + 
        case when Operation='LOP_BEGIN_XACT' then ' Start' 
             when Operation='LOP_ABORT_XACT' then ' Aborted' 
             when Operation='LOP_DELETE_ROWS' then ' Deleting'
             else ' Committed' end as [Description]
    ,[Begin Time]
    ,[Current LSN]
    ,[Operation]
    ,[Context]
    ,[Transaction ID]
    ,[Description]
    ,*
from
    fn_dump_dblog (
        NULL, NULL, N'DISK', @FileNumber, @tranlog,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
        DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT),
    (select
        distinct [Transaction ID] AS [tid]
    from
        fn_dump_dblog (
            NULL, NULL, N'DISK', @FileNumber, @tranlog,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT,
            DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT, DEFAULT)
    where
        [Operation] = 'LOP_DELETE_ROWS') [fd]
where [Transaction ID] = [fd].[tid];
GO

We get the same information from the physical log

User                           SPID    Description                                   Begin Time                         Current LSN                          AllocUnitID
REDMOND\bobtay    59          Delete Transaction Start             2013/09/16 15:54:06:910   00000026:00000298:0001
NULL                         NULL     Deleting                                       NULL                                  00000026:00000298:0003   72057594044153856
NULL                         NULL     Delete Transaction Committed    NULL                                  00000026:00000298:0005

Now in this example we know based on the time frame that this is our record, but if there are a larger number of deletes occurring we need to figure out if this is our table. To do this we must then take the AllocUnitID (which is conveniently translated for us in fn_dblog) and get the table name. The following code will do that for you.

 -- set this from AllocUnitID in the output of fn_dump_dblog
declare @allocUnitID bigint = 72057594044153856 
declare @allocationType tinyint 
declare @containerID bigint

select @containerID = Container_Id, @allocationType = type
    from sys.allocation_units 
where allocation_unit_id = @allocUnitID

if @allocationType IN (1,3)
begin
    select object_name(object_id) as [Object Name]
        from sys.partitions 
    where hobt_id = @containerID
end
else 
begin
    if @allocationType= 2 
    begin
        select object_name(object_id) as [Object Name]
            from sys.partitions 
        where partition_id = @containerID
    end
    else
        select 'Objected is marked for deferred drop.'
end

Just as we expected the output gives us

Object Name
SalesTaxRate

This case is closed!