Troubleshooting Slow Disk I/O in SQL Server
If you suspect you are experiencing poor disk performance you can use internal DMVs combined with a Performance Monitor collection to get a good picture of the health of the disk I/O subsystem and any latency SQL Server is experiencing from its poor performance.
Performance Monitor Counters:
Below is a list of counters and thresholds to watch for when monitoring IO on your SQL Server:
Edit 7/19/2011 - Please monitor the below counters and compare average values over time to the thresholds provided. The counters are intended for OLTP workloads and larger workloads such as OLAP and batch activity may not fit within these thresholds.
Object |
Counter |
Instance |
Threshold |
Comments |
Physical Disk / Logical Disk |
Avg. Disk sec/Read |
* |
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate) |
Measures read latency on the disks |
Physical Disk / Logical Disk |
Avg. Disk sec/Write |
* |
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate) |
Measures write latency on the disks |
Physical Disk / Logical Disk |
Avg. Disk sec/Transfer |
* |
( < .005 excellent; .005 - .010 Good; .010 - .015 Fair; > .015 investigate) |
Measures average latency for read or write operations |
Physical Disk / Logical Disk |
% Idle Time |
* |
> 50% |
The disk should not be working constantly |
Physical Disk / Logical Disk |
Avg. Disk Queue Length |
* |
None |
Be careful when using old thresholds of 2 with SQL Server and SANs. This value can be much higher than 2. Do not rely solely on this counter to diagnose an IO problem. |
Physical Disk / Logical Disk |
Disk reads/sec |
* |
None |
Number of read IOPs |
Physical Disk / Logical Disk |
Disk writes/sec |
* |
None |
Number of write IOPs |
Physical Disk / Logical Disk |
Disk Transfers/sec |
* |
None |
Number of read and write IOPs (used to compare against capacity of storage subsystem) |
Physical Disk / Logical Disk |
Disk Read Bytes/sec |
* |
None |
Used for determine bandwidth utilization for Read Ops |
Physical Disk / Logical Disk |
Disk Write Bytes/sec |
* |
None |
Used to determine bandwidth utilization for Write Ops |
Physical Disk / Logical Disk |
Disk Bytes/sec |
* |
None |
Used to determine bandwidth total bandwidth utilization |
Processor |
% Processor Time |
_Total |
< 80% |
Amount of total CPU usage across all processors |
Processor |
% User Time |
_Total |
< 80% |
Amount of total CPU usage in user mode across all processors |
Processor |
% Privileged Time |
_Total |
< 30% |
Amount of total CPU usage in kernel mode across all processors |
Memory |
Available Mbytes |
n/a |
> 100MB |
Available Physical RAM on the box |
SQL Server:Buffer Manager |
Buffer Cache Hit Ratio |
n/a |
> 98% |
Indication of how often SQL Server is finding data pages in the buffer pool |
SQL Server:Buffer Manager |
Page Life Expectancy |
n/a |
> 300 |
Indicates pressure on memory as data pages are flushed to disk – inducing IO |
SQL Server:Buffer Manager |
Lazy Writes/sec |
n/a |
< 20 |
Measure of when the lazy writer thread must clear up buffers to free space for pages – inducing IO |
SQL Server:Buffer Manager |
Free Pages |
n/a |
> 640 |
Measure of how many free pages are in the buffer pool for use |
SQL Server:Wait Statistics |
Page IO latch waits |
Average wait time (ms) |
None |
Average wait time spent waiting on Page IO between disk and memory |
Additionally, by storing scalar values returned from any of the queries in this post into a variable and executing those queries in a loop or a job, you can then make them available to Performance Monitor as a counter to plot them with other metrics. This allows you to get more granular detailed information and track IO waits and other data on a particular database, application, or session if required.
If you can quantify the information from a query as a single numeric value, then you can pass it out to Performance Monitor to enhance your analysis. For instance, you may choose to find out how many outstanding IOPs exist on SQL Server waiting longer than 10ms with:
select COUNT(*) from sys.dm_io_pending_io_requests where io_type = 'disk' and io_pending_ms_ticks > 10
Or you could map the time spent waiting on PAGEIOLATCH_% with:
select SUM(wait_time_ms) from sys.dm_os_wait_stats where wait_type like 'PAGEIOLATCH%'
For more information on setting up a custom performance counter, check here:
Creating a custom performance monitor counter for SQL Server
SQL Server Dynamic Management Views (DMVs):
The following DMVs will return extensive information that will help determine and diagnose an IO problem as seen from within SQL Server:
To check for outstanding IOPs from within SQL Server waiting longer than 10ms, start with:
select * from sys.dm_io_pending_io_requests where io_type = 'disk' and io_pending_ms_ticks > 10
To get information on IO stalls (a wait for IO that can’t be completed immediately), and the amount of traffic per file in a database, use:
select * from sys.dm_io_virtual_file_stats(db_id(‘<ENTER YOUR DATABASE NAME HERE>’), NULL)
To see if IO is the primary waiter for SQL Server, execute the following and look for wait types with a naming convention like PAGEIOLATCH_%%:
select top 10 * from sys.dm_os_wait_stats order by wait_time_ms desc
To monitor PAGEIOLATCHes for increases in time, periodically check and store these:
select * from sys.dm_os_wait_stats where wait_type like 'PAGEIOLATCH%'
To capture current executing statements that are waiting on IO or just finished waiting on IO:
select
r.session_id,
s.login_name,
s.program_name,
r.start_time,
r.status,
r.command,
r.wait_type,
r.wait_time,
r.last_wait_type,
r.logical_reads,
(r.logical_reads * 8192) as 'KB Read',
r.writes,
(r.writes * 8192) as 'KB Written',
t.[text]
from sys.dm_exec_requests r
cross apply sys.dm_exec_sql_text(sql_handle) t
inner join sys.dm_exec_sessions s
on r.session_id = s.session_id
where s.is_user_process = 1 and
(r.wait_type like 'PAGEIOLATCH%' or r.last_wait_type like
'PAGEIOLATCH%') and
r.session_id != @@SPID
Finally, suspect I/O capacity or bandwidth can be verified with SQLIO and the I/O integrity checked with SQLIOSim:
SQLIO Disk Subsystem Benchmark Tool (SQLIO is a tool provided by Microsoft which can also be used to determine the I/O capacity of a given configuration.)
How to use the SQLIOSim utility to simulate SQL Server activity on a disk subsystem
https://support.microsoft.com/kb/231619
- Jay
Comments
Anonymous
July 16, 2011
Please don't perpetuate the old PLE > 300 rule of thumb. PLE needs to be monitored over time as it's an instantaneous calculation rather than an average. A PLE of, say 360, which by the old 300 threshold definition is fine, would mean the entire buffer pool is being flushed every 6 minutes. With a 100GB buffer pool, is that acceptable? No. PLE needs to be monitored for when it dips below normal value (which will vary for everyone) and stays low.ThanksAnonymous
July 19, 2011
Paul makes a good point here. In my experience, 300 has proven to be a good value - with 2 caveats: First, my experience is with measuring this over a period of time with Performance Monitor (though the post doesn't do a good job of explicitly explaining this). Do not take a single data point of this value dropping below 300 or hovering slighty above 300 to be a problem (or not a problem) - rather collect data over time and analyze the averages. Performance Monitor will show you an average if you have collected these counters over time. For that matter, I wouldn't take any data point here (even the ones that ARE natively average values) as an indication of a problem if I only see one or two datapoints exceeding an expected threshold value. Second, this value (and others in this post) are for OLTP workloads. If you are are doing large batch operations on massive amounts of data - as in a datawarehouse/DSS/OLAP environment - then the above numbers should not be strictly adhered to for collections taken during those workloads. The IDEAL method for every customer is to create a baseline and compare against it. However, the complexity or time involved with Enterprise customers creating baselines across hundreds of SQL instances means it doesn't happen. When a problem occurs, they often need a "rule of thumb" or some measure to gauge where the problem lies and some threshold for monitoring software to trigger alerts. As pointed out by Paul, when setting these up sample them over a period of time to make sure that the values are consistently low. Many times the must be considered in the context of other counters in the same object as well. Your best method is a baseline of the activity of the server when everything was running optimal. Absent that, the counter thresholds have served me well - when viewed for a duration and taken in the right context for general OLTP workloads. Though it is true, if you take a baseline, you may find that anything below 1000 on PLE is unacceptable performance for you. As usual, these rule-of-thumbs are a good starting point but can't replace good data and an understanding of your workload.Anonymous
June 27, 2012
QuestionsOn what type of storage the values for Avg. Disk Sec /Read tested on?Will the threshold values mentioned be different with tier 2 SAN storage - Solid state disks / SAS disks? if Yes are those documented some where? Will the threshold values mentioned above be applicable for all types of storage? ThanksAnonymous
June 28, 2012
These are commonly accepted thresholds that I (and many others) use when evaluating disk I/O latency. Thresholds like these and others are set based on what is considered to be "acceptable" performance levels for the given resource. I don't use different thresholds for different disks. Yes, "tier 2" SAN storage is likely to be slower than "tier 1". Data you place on tier 2 or tier 3 storage should be data used in situations where you can tolerate a higher degree of latency (i.e. backup). Here I am talking about performance metrics for OLTP workloads so that disk I/O is not considered to be a bottleneck. So you can apply these to all types of storage where low latency and performance is a primary consideration - but you should expect SATA to be slower and SSD to be faster.Anonymous
August 06, 2012
Thank you - perfect info for what I needed