Partilhar via


SQLIOSim Checksum Validations

I had a very specific question asked of me related to the SQLIOSIM.exe, checksum validation logic.  It is pretty simple logic (on purpose) but effective so here are the basics.

The key is that there are multiple memory locations used to hold the data and do the comparison.

image

1. Allocate a buffer in memory of 8K.
Stamp the page with random data using Crypto Random function(s)
Save Page Id, File Id, Random seed and calculated checksum values in the header of the page

2. Send the page to stable media (async I/O)
Check for proper write completion success

3. Sometime after successful write(s). Allocate another buffer and read the data from stable media.
(Note: This is a separate buffer from that of the write call)

4. Validate the bytes read.

Do header checks for file id, page id, seed and checksum values

Expected CheckSum: 0xEFC6D39C          ---------- Checksum stored in the WRITE buffer

Received CheckSum: 0xEFC6D39C          ---------- Checksum stored in the READ buffer (what stable media is returning)

Calculated CheckSum: 0xFBD2A468        --------- Checksum as calculated on the READ buffer

The detailed (.TXT) output file(s) show the WRITE image, the READ image and the DIFFERENCES found between them  (think memcmp).   When only a single buffer image is added to the detailed TXT file this indicates that the received header data was damaged or the WRITE buffer is no longer present in memory so only the on disk checksum vs calculated checksum are being compared.
If there appears to be damage SQLIOSim will attempt to read the same data 15 more times and validate before triggering the error condition.   Studies from SQL Server and Exchange showed success of read-retries in some situations.  SQL Server and Exchange will perform up to 4 read-retries in the same situation.
       The window for damage possibilities is from the time the checksum is calculated to the time the read is validated. While this could be SQLIOSim the historical evidence shows this is a low probability. The majority of the time is in kernel and I/O path components and the majority of bugs over the last 8 years have been non-SQL related.         
For vendor debugging the detailed TXT file contains the various page images as well as the sequence of Win32 API calls, thread ids and other information. Using techniques such as a bus analyzer or detailed I/O tracing the vendor can assist at pin-pointing the component causing the damage.

     

The top of the 8K page header is currently the following (Note: This may change with future versions of SQLIOSim.exe)

DWORD Page Number (Take * 8192 for file offset)

DWORD File Id

DWORD Seed value

DWORD Checksum CRC value

BYTE Data[8192 – size(HEADER)] <--------- Checksum protects this data

Bob Dorr - Principal SQL Server Escalation Engineer

Comments

  • Anonymous
    March 02, 2015
    I am seeing only one buffer printed in dump text file after i received error: semaphore timeout period has expired in read io, same error got in write io with same process id. As per mentioned above it may be occurred due to write buffer lost in memory or page header is damaged. In dump file i am seeing only read buffer received there is no write buffer printed also no expected chksum mentioned in dump file which are there in actual data mismatch scenario. Please help to debug semaphore timeout error. What are the reason of this error? I am asking this because i know while writing data to target write doen't happen properly so read fails with same error. please help to debug it further, Thanks in advance.

  • Anonymous
    March 02, 2015
    This is one of the best post i seen to understand how sqliosim works.