Jaa


Understanding Storage Timeouts and Event 129 Errors

Greetings fellow debuggers, today I will be blogging about Event ID 129 messages.  These warning events are logged to the system event log with the storage adapter (HBA) driver’s name as the source.  Windows’ STORPORT.SYS driver logs this message when it detects that a request has timed out, the HBA driver’s name is used in the error because it is the miniport associated with storport.

 

Below is an example 129 event:

 

Event Type:       Warning

Event Source:     <HBA_Name>

Event Category:   None

Event ID:         129

Date:             4/9/2009

Time:             1:15:45 AM

User:             N/A

Computer:         <Computer_Name>

Description:

Reset to device, \Device\RaidPort1, was issued.

 

So what does this mean?  Let’s discuss the Windows I/O stack architecture to answer this.

 

Windows I/O uses a layered architecture where device drivers are on a “device stack.”  In a basic model, the top of the stack is the file system.  Next is the volume manager, followed by the disk driver.  At the bottom of the device stack are the port and miniport drivers.  When an I/O request reaches the file system, it takes the block number of the file and translates it to an offset in the volume. The volume manager then translates the volume offset to a block number on the disk and passes the request to the disk driver.  When the request reaches the disk driver it will create a Command Descriptor Block (CDB) that will be sent to the SCSI device.  The disk driver imbeds the CDB into a structure called the SCSI_REQUEST_BLOCK (SRB).  This SRB is sent to the port driver as part of the I/O request packet (IRP).

 

The port driver does most of the request processing.  There are different port drivers depending on the architecture.  For example, ATAPORT.SYS is the port driver for ATA devices, and STORPORT.SYS is the port driver for SCSI devices.  Some of the responsibilities for a port driver are:

  • Providing timing services for requests.
  • Enforcing queue depth (making sure that a device does not get more requests that it can handle).
  • Building scatter gather arrays for data buffers.

The port driver interfaces with a driver called the “miniport”.  The miniport driver is designed by the hardware manufacturer to work with a specific adapter and is responsible for taking requests from the port driver and sending them to the target LUN.  The port driver calls the HwStorStartIo() function to send requests to the miniport, and the miniport will send the requests to the HBA so they can be sent over the physical medium (fibre, ethernet, etc) to the LUN.  When the request is complete, the miniport will call StorPortNotification() with the NotificationType parameter value set to RequestComplete, along with a pointer to the completed SRB.

 

When a request is sent to the miniport, STORPORT will put the request in a pending queue.  When the request is completed, it is removed from this queue.  While requests are in the pending queue they are timed. 

 

The timing mechanism is simple.  There is one timer per logical unit and it is initialized to -1.  When the first request is sent to the miniport the timer is set to the timeout value in the SRB.  The disk timeout value is a tunable parameter in the registry at: HKLM\System\CurrentControlSet\Services\Disk\TimeOutValue.  Some vendors will tune this value to best match their hardware, we do not advise changing this value without guidance from your storage vendor.

 

The timer is decremented once per second.  When a request completes, the timer is refreshed with the timeout value of the head request in the pending queue.  So, as long as requests complete the timer will never go to zero.  If the timer does go to zero, it means the device has stopped responding.  That is when the STORPORT driver logs the Event ID 129 error.  STORPORT then has to take corrective action by trying to reset the unit.  When the unit is reset, all outstanding requests are completed with an error and they are retried.  When the pending queue empties, the timer is set to -1 which is its initial value.

image002

Each SRB has a timer value set.  As requests are completed the queue timer is refreshed with the timeout value of the SRB at the head of the list.

 

The most common causes of the Event ID 129 errors are unresponsive LUNs or a dropped request.  Dropped requests can be caused by faulty routers or other hardware problems on the SAN.

 

I have never seen software cause an Event ID 129 error.  If you are seeing Event ID 129 errors in your event logs, then you should start investigating the storage and fibre network.

Comments

  • Anonymous
    August 25, 2012
    We are troubleshooting a "non-responsive" type issue on one of our Exchange clusters currently.  The active node in the cluster is posting Event ID 129 with Event Source ql2300. We have supplied logs to our SAN fabric vendor to which they have uncovered no issues. We have engaged our storage array vendor who has thoroughly reviewed the storage frame to which this environment is zoned from both a hardware and performance perspective.  No issues found.  We have collected logs and have been working the hardware support angle for 2 days.  I have seen a few articles referring to an issue with the Microsoft StorPort driver that failed to register the Iologmsg.dll file properly.  We are investigating that now.   I have also suggested that we open a case with Qlogic and provide them details.  Maybe this is a hardware issue on the HBA??  

  • Anonymous
    September 26, 2012
    the LOL OMG msg .dll haha will have to steal that [Good catch in the prior comment.]

  • Anonymous
    December 31, 2012
    The VHD Miniport Driver (vhdmp.sys, logged with a Source of vhdmp in the Event Log) will throw 129 errors every 30 seconds during a backup if the Backup (Volume Snapshot) Integration Service is enabled for a Guest VM but not supported by the Guest VM OS. I get this all the time when I forget to disable the Backup Integration Service for my FreeBSD VMs.

  • Anonymous
    February 07, 2013
    If you are seeing this in a guest virtual machine, it is worth looking at this social.technet.microsoft.com/.../97186e59-2f4e-4f58-b56b-c88f49487211 Some have reported that Event ID 129 is logged under the following conditions: 1)Hyper-V, 2)Windows Server 2012 in a guest virtual machine, 3)One or more drives connected via a virtual SCSI controller. The simple fix is to use the virtual IDE controller instead - although this limits the virtual machine to using 4 drives (because you can't use the SCSI controller). [Hi Mike.  As I understand your description, you are attaching .vhdx files to a virtual SCSI controller and then the VM generates event 129 errors.  The forum post you point to was for a VSS problem on Server 2008 and does not involve Server 2012.  We have not received reports of the problem you describe.  Unfortunately we are not able to provide in depth 1:1 troubleshooting on this blog, I would encourage you to open a case with Microsoft Support to further investigate the problem you experienced.]

  • Anonymous
    February 20, 2013
    I believe Mike is referring to the problem described here: social.technet.microsoft.com/.../e95631c6-c6b0-4dc8-a003-af4adbf113e9 . This is the common setup so far between all of us that are encountering this issue: 1.)  Guest VM is Server 2012 2.)  One or more VHDX (maybe VHD as well) are connected via a virtual SCSI controller. 3.)  Once even 129 is logged in the problematic VM, other VMs become un-response as well. 4.)  The problematic VM can only be turned off (normal OS shut down hangs) 5.)  Detaching the Virtual Disk from the SCSI controller and attaching it to the IDE controller prevents the problem from concurring but then you are facing the 4 VD IDE limit. There are numerous posts about this popping up all over but so far no real resolution besides using IDE instead of SCSI (which isn't a permanent solution). [Thank you for the additional information.  Unfortunately we are not able to provide in depth diagnosis through this blog.  Please open a case with Microsoft Support to investigate this issue.]

  • Anonymous
    June 18, 2013
    Hey guys, this article is awesome, very well laid out, structured and easy to follow.  I wish all were like this.  Can you add one or two items to this article?  Can you add some diagrams / flow charts related to the driver / IO flow described above including storage stack, etc?  I know some are probably within the MS books but would love to see them here for quick read & reference.  Thanks! [Thank you for your feedback.  We don't have any articles showing the flow of I/O through the storage stack, we will consider this for a future article.  In the meantime these two articles somewhat describe how a request gets to storport and what storport does with it: http://blogs.msdn.com/b/ntdebugging/archive/2011/11/23/where-did-my-disk-i-o-go.aspx http://blogs.msdn.com/b/ntdebugging/archive/2012/06/21/what-did-storport-do-with-my-i-o.aspx ]  

  • Anonymous
    December 03, 2014
    We noticed that when this error is logged, call to WriteFile may return NumberOfBytesWritten less then nNumberOfBytesToWrite, however return value is not FALSE and GetLastError is also NO_ERROR. I suppose part of the buffer's data was written and the rest of the data was in queue and was discarded due to timeout and "reset to device". Am I right? Or is it a bug in vendor's driver? [Hi Sergii.  This is not a problem that we are familiar with.  Unfortunately we are not able to provide in depth 1:1 troubleshooting through this blog.  You may want to leverage the resources at http://support.microsoft.com for in depth assistance.]

  • Anonymous
    December 06, 2014
    Your last 2 sentences are incorrect. I have seen software be the issue more often than not.  9 times out of 10 the event 129 error can be resolved with updated network drivers and/or disk timeout values. If you are in a vm environment update your esx, your blades and you local vmware tools to get the most up to date nic drivers and you should be good to go. The standard microsoft support response of "Check with your storage vendor" is misleading to those who actually need a fix. [In the example of a VM environment, the virtualization software is the hardware from the perspective of a Windows guest VM.  The statement at the end of the article is referring to software running inside of Windows, not a virtualization stack.]

  • Anonymous
    May 14, 2015
    The comment has been removed

  • Anonymous
    July 29, 2015
    VMWare actually have a specific KB article relating to some 129 warnings in the event log; kb.vmware.com/.../search.do They suggest it's a problem in the default LSI SAS drivers within Windows, which I think is 1.28.3.52 for Server 2008 R2, that can be fixed by upgrading the version of the driver to 1.32.01 or above, or worked around by using the VMWare paravirtual SCSI drivers, detailed here kb.vmware.com/.../search.do

  • Anonymous
    September 09, 2015
    Same here with event ID 129 as a software (driver) issue.  I have a very nice 2009 vintage Dell XPS-1340 with GeForce 9400M G motherboard and Nvidia graphics.  The problem has been with every OS update, -> Win7, Win8.x, Win10, I get hard disk freezes after the OS "upgrade."  In Win7 and Win8.x, the MCP79 chipset was listed as a "co-processor" in Device Manager and reverting to the Vista vintage drivers for the MCP79 chipset fixed the problem.  For Win10, the MCP79 chipset no longer appears in Device Manager as a co-processor and I am getting Event ID 129 warnings associated with hard disk freezes with a source of nvstor64.sys, the IDE ATA/ATAPI controller driver.  The problem seems to be reverting to the right vintage Nvidia GeForce Storage Management Controller and IDE ATA/ATAPI controller as the latest NVidia drivers in this regard offered for Win10 (of 2009 vintage) do not fix the problem.  I guess the ultimate solution is to chuck Windows 10 and go back to Windows 8.1.  It's been over 6 years with several OS upgrades that don't work well off the shelf in spite of compatibility OK reports and Microsoft and Nvidia do not seem to be able to get their act together on this and Dell prefers to use an OS upgrade to sell you a brand new computer rather than work with Nvidia to update drivers. [An event 129 could be generated due to an issue with storage drivers.  This would be a device timeout caused by the device driver.]