Sdílet prostřednictvím


Are these IO statistics correct?

After a long time, I had the opportunity to meet again my colleague and friend Eladio, who is a well recognized Mentor working at Solid Quality Mentors. He was delivering some training sessions to Microsoft customers in our premises today. In one of his session breaks, we had a brief chat about life, work, family, sports and, of course, SQL Server issues he’s been facing recently. :-)

So, one of those "challenges" was that he wanted me to investigate a bit on some weird behavior he was seeing with the way pages were being retrieved from disk by the buffer pool. Or more, appropriately, the way it was being counted in the IO statistics output.

The steps to reproduce the behavior are described in his post Acabando con los mitos: Memoria usada por bases de datos (Database Snapshot). It’s an introductory sample about Database Snapshots.

The numbers I’ve seen when running his repro in my working machine are a bit different, but still shows that when querying the table through the snapshot database it counts no "physical reads", but all pages are read from disk through the read ahead functionality implemented by the buffer pool, and therefore counted as "read ahead" pages in the IO statistics.

Now, let me explain what SQL Server storage engine does to get the rows one way or another.

In order to retrieve the rows from the table through the Snapshot database, the storage engine uses an unordered scan on the index (::GetScanTypeForIndex), because the following conditions are met:

  1. The leaf level page count exceeds 64 pages, which is a hardcoded arbitrary cost threshold. Since the cost of initiating the unordered scan is relatively high, we only want to perform those type of scans on big enough indexes (> 64 pages on the leaf level) where we expect to benefit from this scan type.
  2. We are not interested in retrieving the rows in index’s key-order.
  3. The computed locking granularity is NOLOCK (ROWSET locking would also be a valid granularity here), and this is the case because the recovery unit for the dataset is read only (::ComputeLockingGranularity), as in every snapshot database.
  4. The dataset must not support direct seeks to rows.
  5. The dataset will be accessed for fetching only.

While, when the SELECT is run against the table in the original database, it is not meeting the third condition above. The resulting computed locking granularity (::ComputeLockingGranularityLong) is ROW. In this situation, the storage engine uses a key ordered scan instead.

A dataset of "unordered scan" type, can prefetch (read ahead) all the leaf level pages which make up the index. This is the reason why you don’t see physical reads, because all those pages retrieved from disk are counted on the "read ahead" statistics.

On the other hand, the "key order scan" type, as its name implies, must scan the index in key order, and that causes that some pages are read from disk one at a time (those counted under "physical reads"), and then the read of the leaf level pages is also prefetched but, potentially and most probably, in a different order than the one used to satisfy the "unordered scan".

One thing to take into consideration here is that there wouldn’t be that difference in behavior if, for example:

  1. the index would fit all its leaf level pages in no more than 64, or
  2. the function which computes the locking strategy for the dataset would have decided that the cheapest locking cost was ROWSET instead of ROW, or
  3. you would have specified a NOLOCK hint in your query.

Eladio, I hope this clarifies your doubts and removes any concerns you had about what you have seen in your example.

In the near future I'll post what was the answer to the other question Eladio asked me today, which had to do with SQL Trace server side traces and the points when it decides it must flush and persist, any events produced until that moment, to the trace file on disk. And more specifically, if it was different for SQL Server 2000 64-bit for IA64 than for SQL Server 2000 32-bit for x86/x64.

Time to go to bed for me now.

Keep in touch!

Comments