How To: Troubleshooting SQL Server I/O bottlenecks

[アーティクル]
06/03/2013

One of the most common reason for server performance issues with respect to SQL Server is the presence of an I/O bottleneck on the system. When I say I/O bottleneck, it can mean issues like slow disks, other processes hogging I/O, out-dated drivers, etc. In this blog, I will seek to outline the approach for identifying and troubleshooting I/O bottlenecks on SQL Server.

The Symptoms

The following are the most common symptoms of an I/O bottleneck on the SQL Server machine:

You see a lot of threads waiting on one or more of the following waits:
- PAGEIOLATCH_*
- WRITELOG
- TRACEWRITE
- SQLTRACE_FILE_WRITE_IO_COMPLETION
- ASYNC_IO_COMPLETION
- IO_COMPLETION
- LOGBUFFER
You see the famous "I/O taking longer than 15 seconds" messages in the SQL Server errorlogs:
2012-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:SEDATAstressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is:0x00000000022000”.

Troubleshooting

Data Collection:

If you see the symptoms outlined above quite frequently on your SQL Server installation, then it will be safe to draw the conclusion that your instance is suffering from a disk subsystem or I/O bottleneck. Let's look at the data collection and troubleshooting approach pertaining to the same:

Enable a custom Performance Monitor collector to capture all disk related counters. Just go to start->run, type perfmon, and hit ok. Next, go to Data Collector sets->User Defined, right click on User Defined, and click New-> Data Collector set.
Note: The best thing about perfmon(apart from the fact that it is built into windows) is that it's a very lightweight diagnostic, and has negligible performance overhead/impact.
Give the data collector set a name, and select Create manually. Under type of data, select the "Create data logs" option, and check the Performance Counter checkbox under it.
Next, click on add performance counters, and select the "LogicalDisk", "Process" and "PhysicalDisk" groups, and select "All instances" for both before adding them.
After you have added the counters, you can also modify the sample interval. You might want to do this if you see spikes lasting less than 15 seconds, which is the default sample interval. I sometimes use an interval of 5 seconds when I want to closely monitor an environment .
Click on Finish and you will now see the new Data Collector set created under User Defined.
Next, right click on the Data Collector set you just created, and click start.

I normally recommend my clients to run the perfmon collector set for at least one business day, so that it has captured data for the load exerted by at least one standard business cycle.

Analysis:

Now that we have the data, we can start the analysis. After stopping the collector set, you can open the blg file generated (the path is displayed under the output column, on the right hand side in perfmon) using perfmon (a simple double click works, as the file type is associated with perfmon by default). Once open, is should have automatically loaded all the counters. Analysing with all the counters can be a bit cumbersome, so I would suggest that you first delete all the counters and then add specific counters one by one.

I will list out the important counters here, along with their expected values:

Process->IO Data Bytes/sec: This counter represents the average amount of IO Data bytes/sec spawned by each process. In conjunction with IO Other Bytes/sec, this counter can be used to determine the average IO per second as well as the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Process-> IO Other Bytes/sec: This counter represents the non-data IO spawned by each process during the capture. Usually, the amount of non-data IO is very low as compared to data IO. Use the total of both IO Data Bytes and IO other bytes to determine the total amount of IO spawned by each process during the capture. Check for the largest I/O consumers, and see if SQL is being starved of I/O due to some other process spawning a large amount of I/O on the system.
Physical Disk/Logical Disk->Avg. Disk Sec/Read: This counter signifies the average amount of time, in ms, that it takes for a read I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Avg. Disk Sec/Write: This counter signifies the average amount of time, in ms, that it takes for a write I/O request to be serviced for each physical/logical disk. An average of less than 10 ms (0.010) is good, and between 10-15 ms (0.010-0.015) is acceptable, but anything beyond 15 ms (0.015) is a cause for concern.
Physical Disk/Logical Disk->Disk Bytes/Sec: This counter represents, in bytes, the throughput of your I/O subsystem for each physical/logical disk. Look for the max value for each disk, and divide it by 1024 twice to get the max throughput in MB for the disk. SAN's generally start from 200-250 MB/s these days. If you see that the throughput is lower than the specifications for the disk, it's not necessarily a cause for concern. Check this counter in conjunction with the Avg Disk Sec/Read or Avg Disk Sec/Write counters (depending on the wait/symptom you see in SQL), and see the latency at the time of the maximum throughput. If the latency is green, then it just means that SQL spawned I/O that was less the disk throughput capacity, and was easily handled by the disk.
Physical Disk/Logical Disk->Avg. Disk Queue Length: This counter represents the average number of I/O's pending in the I/O queue for each physical/logical disk. Generally, if the average is greater than 2 per spindle, it's a cause for concern. Please note that I mentioned the acceptable threshold as 2 per spindle. Most SAN’s these days have multiple spindles. So, for example, if your SAN has 4 spindles, the acceptable threshold for Avg Disk Queue Length would be 8.
Check the other counters to confirm.
Physical Disk/Logical Disk->Split IO/Sec: This counter indicates the I/O's for which the Operating System had to make more than one command call, grouped by physical/logical disk. This happens if the IO request touches data on non-contiguous file segments. It's a good indicator of file/volume fragmentation.
Physical Disk/Logical Disk->%Disk Time: This counter is a general mark of how busy the physical/logical disk is. Actually, it is nothing more than the “Avg. Disk Queue Length” counter multiplied by 100. It is the same value displayed in a different scale. This is the reason you can see the %Disk Time going greater than 100, as explained in the KB https://support.microsoft.com/kb/310067. It basically means that the Avg. Disk Queue Length was greater than 1 during that time. If you've captured the perfmon for a long period (a few hours or a complete business day), and you see the %Disk Time to be greater than 80%, it's generally indicative of a disk bottleneck, and you should take a closer look at the other counters to arrive at a logical conclusion.

It's important to keep 2 things in mind. One, make sure your data capture is not skewed or biased in any way (for example, do not run a capture at the time of a monthly data load or something). Second, make sure you correlate the numbers reflected across the various counters to arrive at the overall picture of how your disks are doing.

Most of the time, I see that people are surprised when they are told that there are I/O issues on the system. Their typical response is "But, it's been working just fine for x years, how can it create a bottleneck now?". The answer lies within the question itself. When the server was initially configured, the disk resources were sufficient for the load on the server. However, with time, it's inevitable that the business grows as a whole, and so do the number of transactions, as well as the overall load. As a result, there comes a day when the load breaches that threshold, and the disk resources on the server are no longer sufficient to handle it. If you come to office one fine day, see high latency on the disks during normal working hours, and are sure that

No special/additional workloads are running on SQL
No other process on the server is spawning excessive I/O,
Nothing changed on the server in the past 24 hours (like a software installation, patching, reboot, etc.)
All the BIOS and disk drivers on the server are up to date,

Then it's highly likely that the load on your server has breached this threshold, and you should think about asking your disk vendor(s) for a disk upgrade (after having them check the existing system once for latency and throughput, of course). Another potential root cause that can cause high latency is that your disk drivers and/or BIOS are out of date. I would strongly recommend checking periodically for updates to all the drivers on the machine, as well as the BIOS.

Hope this helps. As always, comments, feedbacks and suggestions are welcome.

Comments

Anonymous
June 03, 2013
Good going man!! nice post!!
Anonymous
June 04, 2013
Thanks Harsh...!!!
Anonymous
June 04, 2013
nice post . Very informative
Anonymous
June 04, 2013
Thanks for appreciating Samson...!!!
Anonymous
August 12, 2013
Good one....
Anonymous
August 12, 2013
Thanks Karan...!!!
Anonymous
September 24, 2013
Very Nice Post..
Anonymous
September 25, 2013
Thanks for appreciating Swaroop...!!!!
Anonymous
February 05, 2014
Good One. Easy to understand
Anonymous
February 06, 2014
Thanks a lot Ishikha for your kind words...!!!
Anonymous
February 25, 2014
Beautiful - very useful and easy to understand!
Anonymous
February 26, 2014
Thanks a lot Sylvia...!!!!
Anonymous
February 27, 2014
I am trying to determine if a physical SQL cluster is necessary or a pair of VM SQL server with mirror will be sufficient for a new deployment. After some calculation, the expected load will be about 300KB/second in average. In peek time it may be double or triple. Any suggestion is appreciated.
Anonymous
April 01, 2014
outstanding! i am learning SQL Server and this really helps
Anonymous
April 17, 2014
Very good article. I/O bottle neck may reduce after update statistics and rebuild index. How to update statistics and rebuild index for database size 2TB and order of doing the activity
Anonymous
August 14, 2014
The comment has been removed
Anonymous
August 18, 2014
Hi Maggie...unfortunately, the answer is it depends. It depends on the hardware capacity, the I/O patterns and trends, and on the application code. I would strongly recommend that you test whatever solution you use before deploying it in production.
Anonymous
August 18, 2014
Thanks Oracleman...happy to know you found the blog helpful.
Anonymous
August 18, 2014
The comment has been removed
Anonymous
August 18, 2014
Hi Ansar...thanks for appreciating. A disk bottleneck on any disk on the system can impact the overall system performance due to a cascading effect, but it should not cause the response time/disk latency for the other disk(s) to go up. Hope this helps.
Anonymous
October 01, 2014
Hi HarshDeep_Singh I currently have a problem of Application hang. My application won't display its UI when i click the menu in just displaying white screen (like endless looping). This problem usually solved after restarting the sql service, but happening again next day. I've checked all the deadlocked process and its fine. I wonder if there is caused by bottleneck effect
Anonymous
October 20, 2014
Hi Andre...thanks for reaching out. It could be due to bottleneck, but could also be due to other factors such as contention for resources at the OS level, leaked (orphaned) transactions holding locks, etc. it would require detailed troubleshooting from both the app side and SQL side to nail down the exact cause.
Anonymous
November 18, 2014
Hi Harshdeep, With help of your post i tried troubleshooting the "SQL Server has encountered 2 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file" issue and got Avg. Disk Sec/Write as .080 and more at random times. Then i checked the space issues on the Drive where i have enabled database logging, it was 20% free; CPU and memory is also good still i am getting same errors of I/O taking longer than 15 seconds. After researching more on this i got to know that: On computers with multiple CPUs, the CPUs are designed to “sleep” during periods of low workload. When CPUs sleep, SQL may not accurately determine CPUs overall workload and incorrectly report this as IO WARNING shown above. I need to confirm from you if my server power option is set to " Turn Off Display to 10 Minutes" can this lead to this I/O issues. Thanks Brijesh
Anonymous
December 08, 2014
Hi Brijesh, Thanks for expressing interest in the blog. I believe you need to run perfmon for a bit more, and check the avg disk sec/read and avg disk sec/transfer counters as well. Writes are generally faster than reads on most SAN's, though the exact details may vary. You should also check disk bytes/sec to see the amount of throughput you're getting. I believe you are facing an I/O bottleneck, and disk free space and power options should not have anything to do with it. Hope this helps.
Anonymous
August 16, 2015
Hi HarshDeep Very good explanation and approach. Thanks for this concise blog.
Anonymous
August 23, 2015
may i know what are the scale that i should use for average sec disk/write and average disk queue length.. also one more question, how do i know how many spindle my server/vm have..?thanks in advance.
Anonymous
September 03, 2015
We are observing Avg disk read / sec ( random spikes > 200 ms), Avg disk write / sec ( random spikes > 500 ms) and Disk Transfer / sec ( random spikes > 200) during ETL jobs bulk migration. Please tell me if Disk I/O is performance bottleneck in this case?

次の方法で共有

How To: Troubleshooting SQL Server I/O bottlenecks

Comments

その他のリソース