Share via


PerfGuide: Analyzing Poor Disk Response Times

If you have arrived here, then you have identified a potential disk performance condition where “\LogicalDisk(*)\Avg. Disk sec/Read” and/or “\LogicalDisk(*)\Avg. Disk sec/Write” performance counters are greater than 15ms (0.015 seconds) or more on average on your Windows computer. If this is not correct, then return to the Start of the Performance Guide.

Common Used Disk Terminology

** Spindle:** a spindle of a hard disk. Jargon for a physical hard disk.

LUN: A LUN (line unit number) is the identifier of a device which is being addressed by the SCSI protocol or similar protocols such as Fibre Channel and iSCSI. This is jargon for the the representation of a physical disk to the Windows operating system – meaning Windows thinks it is a single spindle, but could be many spindles masked by the hardware.

PhysicalDisk: The the PhysicalDisk counter object is the physical disk representation to Windows. Same as a LUN from the operating system perspective.

LogicalDisk: A drive letter or mount point mapped to a the representation of a physical disk from an operating system perspective.

Disk Transfers: This is the total number of Windows read and write operations on a disk. This is the number of I/O’s that Windows is doing, but could translate to one or more I/O’s on the physical hardware.

Disk Virtualization

“\PhysicalDisk(*)\Avg. Disk Queue Length” and “\PhysicalDisk(*)\ Disk Time” are great counters to use for initial indicators of disk performance when you know how many physical spindles are behind the LUN. In the enterprise server environment, Storage Area Networks (SAN) are common and many virtualize the disk spindles so much that no one really knows how many spindles are behind a given LUN. Therefore, Avg disk queue length and % Disk Time become unreliable when measuring performance of hardware RAID and SAN performance.

Since disk virtualization can make it difficult or impossible to know how many spindles are behind a given LUN, and since performance is ultimately the perception of the end user, our best indicator of a disk performance problem is disk response times. The performance counters that measure disk response times are “\LogicalDisk(*)\Avg. Disk sec/Read” and/or “\LogicalDisk(*)\Avg. Disk sec/Write”. A standard uncached 5400 RPM disk drive can do about 200 IOPS (I/O’s per second) with about 5ms on average per 4K I/O with 100% random writes. Therefore, if disk performance is greater than 3 times that amount (15ms or 0.015 seconds), it warrants further investigation to see why the disk performance is not as good as expected.

Troubleshooting

The PAL tool checks and throws warning alerts for response times greater than 15ms and critical alerts for response time greater than 25ms.

Response times measured by “\LogicalDisk(*)\Avg Disk sec/Read” and “\LogicalDisk(*)\Avg Disk sec/Write” is our most authoritative and primary indicator of disk performance. With that said, it can’t tell us the entire story. When you see high disk response times, it simply warrants more investigation. For example, slow disk response times could be normal if the average I/O size is relatively large such as 2MBs per I/O.

Initial Indicators

Disk Read and Write response times: Monitor the counters “\LogicalDisk(*)\Avg Disk sec/Read” and “\LogicalDisk(*)\Avg Disk sec/Write” for disk response times on average greater than 15ms (0.015 seconds). Infrequent spikes above 25ms (0.025 seconds) is normal.

Symptoms

If a disk has poor response times, then check the following:

  1. Monitor IOPS (I/O’s per second): Monitor the counter “\LogicalDisk(*)\Disk Transfers/sec” when disk response times are high. Generally, a single 5400RPM drive should be able to do more than 80 disk transfers per second. Therefore, if disk transfers are lower than 80 when response times are high (greater than 15ms), then the disk is performing slower than a single 5400RPM spindle. Use this information to help with diagnosis.

  2. Large Bytes Per I/O: The size of the disk transfers can have an impact on disk response times. Use the counters “\LogicalDisk(*)\Avg Disk Bytes/Read” and “\LogicalDisk(*)\Avg Disk Bytes/Write” to measure the size of the disk transfers. If the sizes are greater than 64K on average, then longer than normal response times are normal.

    1. Consider Larger Blocksizes: To potentially optimize this condition, consider formatting the logical disk at 64K or at the most common disk transfer size if possible. The default blocksize is 4K. *Warning:* Adjusting the blocksize of a logical disk requires reformatting which will erase all data on the logical disk.
      1. Use the following command to check the blocksize of disk logical disks:
        wmic volume GET Caption, BlockSize

        Sample Output:
        BlockSize  Caption
        4096       \?\Volume{bdf048e7-33e6-11df-b5ae-806e6f6e6963}\
        4096       D:\
        4096       C:\

  3. Disk Fragmentation: If this is spinning media, then disk fragmentation can cause poor disk performance. Consider running a disk fragmentation analysis on the affected disks. If this is non-spinning media such as solid state disks (SSD), then disk defragmentation will not help. See the following documents for more information:

    1. Disk Fragmentation and System Performance
  4. Low Free Disk Space: Monitor the counter “\LogicialDisk(*)\ Free Space”. Check if the value is less than 10. Spinning media commonly writes to the outer sections of the spindle and write towards the center hub. When free disk space is low, the head will spend more time waiting on the slower hub of the spindle. If free disk space is low, then:

    1. Delete any unnecessary files from the disk.
    2. Move data that can be safely moved to another location.
    3. Add disk capacity to the server.
  5. Failing Hardware: Disk failures are typically progressive and fail over time. Check the System Event Logs will for disk read/write failures and troubleshoot the hardware.

  6. Poorly Performing Drivers and Anti-x Software: Poor performance could be the result of poorly written disk drivers and/or anti-x (virus, intrusion, etc.) drivers doing unnecessary scans. Consider using Process Monitor (a Mark Russinovich SysInternals tool owned by Microsoft) and/or Microsoft xPerf (part of the Windows Server 2008 Performance Toolkit) to capture a sample of disk activity when the symptoms persist.

  7. **Identify Processes and Files Using Disk Resources: **Use Process Monitor (a Mark Russinovich SysInternals tool owned by Microsoft), Microsoft xPerf (part of the Windows Server 2008 Performance Toolkit), or from Resource Monitor which is part of Windows 7 to identify the processes and files most frequently using disk resources. This helps identify unnecessary disk usage.

  8. Misconfigured Disk Cache: Monitor the affected disk over time and determine the read/write ratio of the disks using the counters “\LogicalDisk(*)\Disk Reads/sec” and “\LogicalDisk(*)\Disk Writes/sec”. Adjust the hardware disk cache (if applicable) to match the read/write ratio.

  9. Oversaturated Hardware: Hardware between the Windows driver and the physical spindle might be saturated such as the fibre channel hardware (cable, switches, etc.) between the Host Bus Adapter (HBA) and the SAN or the SAN itself is misconfigured or saturated.

    1. Check the Hardware: Use hardware vendor provided tools to check the SAN fabric and hardware between the HBA and SAN.
    2. Dedicate and/or Allocate More Spindles: Consult with your hardware vendor on the possibility of dedicating or allocating more physical spindles to the affected LUNs.
    3. **Change the RAID Type: **Monitor the counters “\LogicalDisk(*)\Disk Reads/sec” and “\LogicalDisk(*)\Disk Writes/sec” and compare them to create a ratio of reads to writes. If most of the disk I/O is write operations, then avoid RAID types that multiply the number of I/O’s required to do a write operation such as RAID5 (4 I/O’s per write operation) and RAID6 (6 I/O’s per write operation).
  10. Partition Misalignment: Misaligned disk partitions can cause unnecessary disk I/O at the hardware level for disks. Use the following command line (works on Windows XP/2003 and greater), to check the starting offset of the disk partitions:
    wmic partition GET Caption, StartingOffset

    Sample Output:
    Caption                StartingOffset
    Disk #0, Partition #0  1048576
    Disk #0, Partition #1  105906176
    Disk #0, Partition #2  21475885056

    A StartingOffset (in bytes) that is not 1,048,576 (1MB) is likely misaligned. Partitions created on Windows Vista/2008 and greater will automatically use a 1MB starting offset.

    See the following document(s) for more information:

    1. Disk Partition Alignment Best Practices for SQL Server