How It Works: Debugging SQL Server Stalled or Stuck I/O Problems - Root Cause
Previously I have covered stuck and stalled I/O issues in other posts and articles. However, these only tell you that there is a problem outside the SQL Server engine. This post attempts to extend your root cause debugging capabilities beyond the SQL Server process.
- https://blogs.msdn.com/psssql/archive/2006/11/27/what-do-i-need-to-know-about-sql-server-database-engine-i-o.aspx
- https://www.microsoft.com/technet/prodtechnol/sql/2005/diagandcorrecterrs.mspx
NOTE: Consult with your Windows Administrator before considering these techniques because this is an advanced technique. If you have further questions about using these techniques and need the assistance of Microsoft you should consult with Windows Support.
SQL Server considers a stuck or stalled I/O one that has not achieved completion status from the operating system level onward. The problem could be an OS issue, 3rd party driver, hardware or other I/O path related problem.
When SQL Server makes the ReadFile, WriteFile or other I/O call the Windows I/O manager (kernel mode) creates the IRP and begins processing the request. The IRP has a stack concept and when exposed shows the owning stack location. To see the IRP stack you have to look into the kernel address space.
Microsoft provides a set of debugging tools that can be used to enumerate the active IRPs.
WARNING |
|
No Reboot Required - Operating System Remains Active (Windows 2003 32 and 64 bit, Windows XP 32 and 64 bit, Vista 32 bit)
This method does not require reboot of the computer. It works off the live server and as such the data could be in-flight and may require more than one capture attempt to determine the proper IRP(s) to troubleshoot.
Install the Debugging Tools For Windows for the native computer architecture - https://www.microsoft.com/whdc/devtools/debugging/default.mspx
Install LiveKD in the same directory you installed the debugging tools. https://technet.microsoft.com/en-us/sysinternals/bb897415.aspx (LiveKD is not currently supported on 64 bit Vista and Windows 2008 unless you disable signed driver compliance.)
At the end of this post is the FILES TO SAVE section. Save the files as directed.
Open a command prompt and change (cd) to the debugger install directory. Then issue the following commands. This will capture the information in the CollectSQLIRPs.log file. Use the evaluation section of this post to evaluate the output.
set _NT_SYMBOL_PATH=srv*c:tempsymbols*https://msdl.microsoft.com/download/symbols
livekd.exe
Reboot Required - Operating System Remains Active
This method does require reboot of the computer. It works off the live server and as such the data could be in-flight and may require more than one capture attempt to determine the proper IRP(s) to troubleshoot. This technique uses the live debugging capabilities of the Windows Debuggers to obtain the information.
Install the Debugging Tools For Windows for the native computer architecture - https://www.microsoft.com/whdc/devtools/debugging/default.mspx
Add /DEBUG to the boot.ini or use bcdedit.exe, Debug On to enable kernel debug capabilities.
Restart the computer
At the end of this post is the FILES TO SAVE section. Save the files as directed.
Open a command prompt and change (cd) to the debugger install directory. Then issue the following commands.
set _NT_SYMBOL_PATH=srv*c:tempsymbols*https://msdl.microsoft.com/download/symbols
windbg.exe -klIn the command window of the debugger issue the following command. This will capture the information in the CollectSQLIRPs.log file. Use the evaluation section of this post to evaluate the output.
$<collectsqlirps.script
Reboot Required - Operating System Frozen
This method does require reboot of the computer and a second debugging client computer.
Install the Debugging Tools For Windows for the native computer architecture - https://www.microsoft.com/whdc/devtools/debugging/default.mspx
Follow documented instructions to enabled kernel debugging from a client machine. You commonly need a debug cable and other settings to establish the debugging session.
At the end of this post is the FILES TO SAVE section. Save the files as directed.
Open a command prompt and change (cd) to the debugger install directory. Then issue the following commands.
set _NT_SYMBOL_PATH=srv*c:tempsymbols*https://msdl.microsoft.com/download/symbols
windbg.exeEstablish the kernel debugging session. File | Kernel Debug
In the command window of the debugger issue the following command. This will capture the information in the CollectSQLIRPs.log file. Use the evaluation section of this post to evaluate the output.
$<collectsqlirps.script
Evaluation
The output from the debugging script is placed in the CollectSQLIRPs.log file. The script will take a snapshot of the IRPs, wait 10 seconds and take another snapshot. Using the snapshots you can find the IRPs that appear to be stuck or stalled and further investigate.
Not all IRPs are expected to change state.
The following shows the an IRP waiting for a network connection and it is expected to wait. You have to look at each IRPs type it understand the expected behavior. SQL Server File I/O based IRPs should complete quickly.
THREAD ADDRESS 0xfffffadf9d396840
.echo 0xfffffadf9c9a1200
.echo 0xfffffadfa7896010
IRP ADDRESS 0xfffffadf9c9a1200
+0x000 Type : 5
+0x002 Size : 184
+0x008 DeviceObject : 0xfffffadf`9d6ecbc0 _DEVICE_OBJECT
+...+0x058 FileName : _UNICODE_STRING "SQLLocalSQL2005"
...+0x038 DriverName : _UNICODE_STRING "FileSystemNpfs"
Irp is active with 1 stacks 1 is current (= 0xfffffadf9c9a12d0)
No Mdl: No System Buffer: Thread fffffadf9d396840: Irp stack trace.
Flags = 00000800
ThreadListEntry.Flink = fffffadfa7896030
ThreadListEntry.Blink = fffffadf9d396bd8
IoStatus.Status = 00000000
IoStatus.Information = 00000000
RequestorMode = 00000001
Cancel = 00
CancelIrql = 0
ApcEnvironment = 00
UserIosb = 00edfab0
UserEvent = 00000000
Overlay.AsynchronousParameters.UserApcRoutine = 00000000
Overlay.AsynchronousParameters.UserApcContext = 00edfab0
Overlay.AllocationSize = 00000000 - 00000000
CancelRoutine = fffffadf90ce3320 Npfs!NpCancelChangeNotifyIrp
UserBuffer = 00000000
&Tail.Overlay.DeviceQueueEntry = fffffadf9c9a1278
Tail.Overlay.Thread = fffffadf9d396840
Tail.Overlay.AuxiliaryBuffer = 00000000
Tail.Overlay.ListEntry.Flink = fffffa80047460b8
Tail.Overlay.ListEntry.Blink = fffffa80047460b8
Tail.Overlay.CurrentStackLocation = fffffadf9c9a12d0
Tail.Overlay.OriginalFileObject = fffffadf9dad85e0
Tail.Apc = 00000000
Tail.CompletionKey = 00000000
cmd flg cl Device File Completion-Context
>[ d, 0] 5 1 fffffadf9d6ecbc0 fffffadf9dad85e0 00000000-00000000 pending
FileSystemNpfs
Args: 00000000 00000000 00110008 00000000
The following is an example of a stuck IRP.
This IRP stayed in the same state for over an hour. Looking closer at the stack I could tell it was waiting for the client to send a TDS packet. This should be waiting at the network driver (TPC for example) level and got stuck on its way in MYDRIVER.
Note: This is a stuck network IRP but the technique applies to file I/O as well.
!irp fffffadf9d599c60
Irp is active with 4 stacks 2 is current (= 0xfffffadf9d599d78)
Mdl=fffffadf9d83c310: No System Buffer: Thread fffffadf9c839770: Irp stack trace.
cmd flg cl Device File Completion-Context
[ 0, 0] 0 0 00000000 00000000 00000000-00000000
Args: 00000000 00000000 00000000 00000000
>[ f, 8] 0 e1 fffffadf9dd86af0 fffffadf9c9cda20 fffffadf8d468e20-00000000 Success Error Cancel pending
DriverTcpip mydriver
Args: 1002000001f69 00000000 00000000 00000000
Using the !thread command and public symbols I can see stack.
!thread fffffadf9c839770
THREAD fffffadf9c839770 Cid 0600.1360 Teb: 000007fffff80000 Win32Thread: 0000000000000000 WAIT: (UserRequest)
...Child-SP RetAddr Call Site
fffffadf`688c0970 fffff800`01027682 nt!KiSwapContext+0x85 [d:ntbasentoskeamd64ctxswap.asm @ 80]
fffffadf`688c0af0 fffff800`0102828e nt!KiSwapThread+0x3c9 [d:ntbasentoskethredsup.c @ 2018]
fffffadf`688c0b50 fffff800`011129fe nt!KeWaitForSingleObject+0x5a6 [d:ntbasentoskewait.c @ 1256]
fffffadf`688c0bd0 fffff800`0102e33d nt!NtSignalAndWaitForSingleObject+0x18e [d:ntbasentosobobwait.c @ 278]
fffffadf`688c0c70 00000000`77ef1b5a nt!KiSystemServiceCopyEnd+0x3 (TrapFrame @ fffffadf`688c0c70)
00000000`0af6f028 00000000`77d44ebb ntdll!ZwSignalAndWaitForSingleObject+0xa
00000000`0af6f030 00000000`01003ff0 kernel32!SignalObjectAndWait+0x133 [d:ntbasewin32clientsynch.c @ 1309]
00000000`0af6f0e0 00000000`01003693 sqlservr!SOS_Scheduler::SwitchContext+0x378
00000000`0af6f390 00000000`0100353e sqlservr!SOS_Scheduler::Suspend+0xaf
00000000`0af6f400 00000000`0135957f sqlservr!SOS_Event::Wait+0x151
00000000`0af6f460 00000000`01359c19 sqlservr!CNetConnection::FWaitForNewPacket+0x7e
00000000`0af6f570 00000000`0103e7f3 sqlservr!CNetConnection::FReadTdsPacket+0x1f5
00000000`0af6f6a0 00000000`01035a5b sqlservr!CNetByteStream::ReadTDSPacketFromNetwork+0xd4
00000000`0af6f740 00000000`0103584a sqlservr!CNetByteStream::EsetBeginNewStream+0x8b
00000000`0af6f840 00000000`0100da8e sqlservr!process_commands+0x14a
00000000`0af6faf0 00000000`0100dc11 sqlservr!SOS_Task::Param::Execute+0xee
00000000`0af6fc00 00000000`0100e971 sqlservr!SOS_Scheduler::RunTask+0xc9
At the end of the capture the lmkv was issued to list the details about the loaded kernel modules. Using this information I can find out more details about the driver such as file location.
Loaded symbol image file: MYDRIVER.SYS
Image path: ??C:WINDOWSsystem32DriversMYDRIVER.SYS
Image name: MYDRIVER.SYS
Timestamp: Wed Jul 20 15:20:21 2005 (42DEB205)
CheckSum: 000045C7
ImageSize: 00008000
Translations: 0000.04b0 0000.04e0 0409.04b0 0409.04e0
------ FILES TO SAVE ------
NTSD.INI
Save the following as file NTSD.INI in the debugger install directory.
$<collectsqlirps.script
collectsqlirps.script
Save the following as file collectsqlirps.script in the debugger install directory. Note, the debugger does not let command span more than one line. The .foreach commands in this script have to remain on a single line without CRLR.
$$===========================================================================
$$=== The following dumps all Thread IRPs for the SQLServr.exe processes
$$=== !process 0 0 sqlservr.exe
$$===========================================================================
.logopen CollectSQLIRPs.log$$ Turn off a bunch of output that would cause .foreach to receive unwanted tokens
!sym quiet
!sym prompts off.symopt-0x01
.symopt-0x04
.symopt-0x02
.symopt-0x10
.symopt-0x20
.symopt-0x10000
.symopt-0x20000.echo Gathering offsets
r $t1 = @@C++(#FIELD_OFFSET(nt!_ETHREAD, ThreadListEntry))
r $t2 = @@C++(#FIELD_OFFSET(nt!_ETHREAD, IrpList))
r $t3 = @@C++(#FIELD_OFFSET(nt!_IRP, ThreadListEntry))
r $t4 = @@C++(#FIELD_OFFSET(nt!_EPROCESS, ThreadListHead))
r $t5 = @@C++(#FIELD_OFFSET(nt!_LIST_ENTRY, Blink))
r $t6 = @@C++(#FIELD_OFFSET(nt!_IRP, Tail.Overlay.OriginalFileObject))
r $t7 = @@C++(#FIELD_OFFSET(nt!_FILE_OBJECT, DeviceObject))
r $t8 = @@C++(#FIELD_OFFSET(nt!_DEVICE_OBJECT, DriverObject)).echo =========================================================
.echo Thread IRPS for SQL Server
.echo =========================================================
.time
.foreach /pS 1 /ps 11 (process {!process 0 0 sqlservr.exe}) { .echo EPROCESS ADDRESS ${process}; dt ${process} nt!_EPROCESS ImageFileName; .foreach (threadaddress { !list -t nt!_ETHREAD.ThreadListEntry.Flink -x ".echo" poi(${process}+$t4)-$t1 } ) { .echo THREAD ADDRESS ${threadaddress}; .foreach(irpaddress { !list -t nt!_IRP.ThreadListEntry.Flink -x ".echo" poi(${threadaddress}+$t2)-$t3 } ) {.echo IRP ADDRESS ${irpaddress}; j( poi(${irpaddress}+$t3) = poi(poi(${irpaddress}+$t3)+$t5)) '.echo No active IRP on Thread' ; 'dt poi(${irpaddress}+$t6) nt!_FILE_OBJECT; dt poi(poi(${irpaddress}+$t6)+$t7) nt!_DEVICE_OBJECT; dt poi(poi(poi(${irpaddress}+$t6)+$t7)+$t8) nt!_DRIVER_OBJECT; !irp ${irpaddress} 1' } } }.echo =========================================================
.echo Sleeping for 10 seconds
.echo =========================================================
.sleep 10000.echo =========================================================
.echo Thread IRPS for SQL Server
.echo =========================================================
.time
.foreach /pS 1 /ps 11 (process {!process 0 0 sqlservr.exe}) { .echo EPROCESS ADDRESS ${process}; dt ${process} nt!_EPROCESS ImageFileName; .foreach (threadaddress { !list -t nt!_ETHREAD.ThreadListEntry.Flink -x ".echo" poi(${process}+$t4)-$t1 } ) { .echo THREAD ADDRESS ${threadaddress}; .foreach(irpaddress { !list -t nt!_IRP.ThreadListEntry.Flink -x ".echo" poi(${threadaddress}+$t2)-$t3 } ) {.echo IRP ADDRESS ${irpaddress}; j( poi(${irpaddress}+$t3) = poi(poi(${irpaddress}+$t3)+$t5)) '.echo No active IRP on Thread' ; 'dt poi(${irpaddress}+$t6) nt!_FILE_OBJECT; dt poi(poi(${irpaddress}+$t6)+$t7) nt!_DEVICE_OBJECT; dt poi(poi(poi(${irpaddress}+$t6)+$t7)+$t8) nt!_DRIVER_OBJECT; !irp ${irpaddress} 1' } } }.echo =========================================================
.echo Kernel Module List
.echo =========================================================
lmkv.echo =========================================================
.echo Process list
.echo =========================================================
!process 0 0$$ Close the log file
.logclose
q
Bob Dorr
SQL Server Senior Escalation Engineer
Comments
Anonymous
March 03, 2008
You've been kicked (a good thing) - Trackback from DotNetKicks.comAnonymous
January 17, 2011
Having non-yielding scheduler issue on a Windows 2003 Version 3790 (SP2)MP(16 procs) Free x64 serve with SQL 2005 SP3 running on it. Every day we see IO warning msgs after 6 am and cluster becomes un-responsive by 8:38 am, cluster fail over to other node. Until 8:37 system is accessible and IO response on the drive is very good. I have analysed sql and cluster hang dump, but still unable find the root cause. Could you please suggest, let me know if you need the dump or windbg output.