Jaa


Determining the source of Bug Check 0x133 (DPC_WATCHDOG_VIOLATION) errors on Windows Server 2012

What is a bug check 0x133?

Starting in Windows Server 2012, a DPC watchdog timer is enabled which will bug check a system if too much time is spent in DPC routines. This bug check was added to help identify drivers that are deadlocked or misbehaving.  The bug check is of type "DPC_WATCHDOG_VIOLATION" and has a code of 0x133.  (Windows 7 also included a DPC watchdog but by default, it only took action when a kernel debugger was attached to the system.)  A description of DPC routines can be found at https://msdn.microsoft.com/en-us/library/windows/hardware/ff544084(v=vs.85).aspx.

 

The DPC_WATCHDOG_VIOLATION bug check can be triggered in two ways. First, if a single DPC exceeds a specified number of ticks, the system will stop with 0x133 with parameter 1 of the bug check set to 0.  In this case,the system's time limit for single DPC will be in parameter 3, with the number of ticks taken by this DPC in parameter 2.  Alternatively, if the system exceeds a larger timeout of time spent cumulatively in all DPCs since the IRQL was raised to DPC level, the system will stop with a 0x133 with parameter 1 set to 1.  Microsoft recommends that DPCs should not run longer than 100 microseconds and ISRs should not run longer than 25 microseconds, however the actual timeout values on the system are set much higher.

 

How to debug a 0x133 (0, …

In the case of a stop 0x133 with the first parameter set to 0,the call stack should contain the offending driver.  For example, here is a debug of a 0x133 (0,…) kernel dump:

 

0: kd> .bugcheck

Bugcheck code 00000133

Arguments 00000000`00000000 00000000`00000283 00000000`00000282 00000000`00000000  

 

Per MSDN, we know that this DPC has run for 0x283 ticks, when the limit was 0x282.

 

0: kd> k

Child-SP          RetAddr           Call Site

fffff803`08c18428 fffff803`098525df nt!KeBugCheckEx

fffff803`08c18430 fffff803`09723f11 nt! ??::FNODOBFM::`string'+0x13ba4

fffff803`08c184b0 fffff803`09724d98 nt!KeUpdateRunTime+0x51

fffff803`08c184e0 fffff803`09634eba nt!KeUpdateTime+0x3f9

fffff803`08c186d0 fffff803`096f24ae hal!HalpTimerClockInterrupt+0x86

fffff803`08c18700 fffff803`0963dba2 nt!KiInterruptDispatchLBControl+0x1ce

fffff803`08c18898 fffff803`096300d0 hal!HalpTscQueryCounter+0x2

fffff803`08c188a0 fffff880`04be3409 hal!HalpTimerStallExecutionProcessor+0x131

fffff803`08c18930 fffff880`011202ee ECHO!EchoEvtTimerFunc+0x7d                //Here is our driver, and we can see it calls into StallExecutionProcessor

fffff803`08c18960 fffff803`097258b4 Wdf01000!FxTimer::TimerHandler+0x92

fffff803`08c189a0 fffff803`09725ed5 nt!KiProcessExpiredTimerList+0x214

fffff803`08c18ae0 fffff803`09725d88 nt!KiExpireTimerTable+0xa9

fffff803`08c18b80 fffff803`0971fe76 nt!KiTimerExpiration+0xc8

fffff803`08c18c30 fffff803`0972457a nt!KiRetireDpcList+0x1f6

fffff803`08c18da0 00000000`00000000 nt!KiIdleLoop+0x5a

 

Let’s view the driver’s unassembled DPC routine and see what it is doing

 

0: kd> ub fffff880`04be3409

ECHO!EchoEvtTimerFunc+0x54:

fffff880`04be33e0 448b4320        mov     r8d,dword ptr[rbx+20h]

fffff880`04be33e4 488b0d6d2a0000  mov     rcx,qword ptr [ECHO!WdfDriverGlobals (fffff880`04be5e58)]

fffff880`04be33eb 4883631800      and     qword ptr [rbx+18h],0

fffff880`04be33f0 488bd7          mov     rdx,rdi

fffff880`04be33f3 ff150f260000    call    qword ptr [ECHO!WdfFunctions+0x838(fffff880`04be5a08)]

fffff880`04be33f9 bbc0d40100      mov     ebx,1D4C0h

fffff880`04be33fe b964000000      mov     ecx,64h

fffff880`04be3403 ff15f70b0000    call    qword ptr[ECHO!_imp_KeStallExecutionProcessor (fffff880`04be4000)]   //Its Calling KeStallExecutionProcessorwith 0x64 (decimal 100) as a parameter

0: kd> u fffff880`04be3409

ECHO!EchoEvtTimerFunc+0x7d:

fffff880`04be3409 4883eb01        sub     rbx,1

fffff880`04be340d 75ef            jne     ECHO!EchoEvtTimerFunc+0x72 (fffff880`04be33fe)     //Here we can see it is jumping back to call KeStallExecutionProcessor in a loop

fffff880`04be340f 488b5c2430      mov     rbx,qword ptr[rsp+30h]

fffff880`04be3414 4883c420        add     rsp,20h

fffff880`04be3418 5f              pop     rdi

fffff880`04be3419 c3              ret

fffff880`04be341a cc              int     3

fffff880`04be341b cc              int     3

 

0: kd> !pcr

KPCR for Processor 0 at fffff80309974000:

    Major 1 Minor 1

      NtTib.ExceptionList: fffff80308c11000

          NtTib.StackBase: fffff80308c12080

         NtTib.StackLimit: 000000d70c7bf988

       NtTib.SubSystemTib: fffff80309974000

            NtTib.Version: 0000000009974180

        NtTib.UserPointer: fffff803099747f0

            NtTib.SelfTib: 000007f7ab80c000

 

                  SelfPcr: 0000000000000000

                     Prcb: fffff80309974180

                     Irql: 0000000000000000

                      IRR: 0000000000000000

                      IDR: 0000000000000000

            InterruptMode: 0000000000000000

                      IDT: 0000000000000000

                      GDT: 0000000000000000

                      TSS: 0000000000000000

 

            CurrentThread: fffff803099ce880

               NextThread: fffffa800261cb00

               IdleThread: fffff803099ce880

 

                DpcQueue:  0xfffffa80020ce790 0xfffff880012e4e9c [Normal] NDIS!NdisReturnNetBufferLists

                           0xfffffa800185f118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                           0xfffff8030994fda0 0xfffff8030972bc30 [Normal] nt!KiBalanceSetManagerDeferredRoutine

                           0xfffffa8001dbc118 0xfffff88000c0ca00 [Normal] ataport!AtaPortInitialize

                           0xfffffa8002082300 0xfffff88001701df0 [Normal] USBPORT

 

The !pcr output shows us queued DPCs for this processor. If you want to see more information about DPCs and the DPC Watchdog, you could dump the PRCB listed in the !pcr output like this:

 

dt nt!_KPRCB fffff80309974180 Dpc*

 

Often the driver will be calling into a function like KeStallExecutionProcessor in a loop, as in our example debug.  To resolve this problem, contact the driver vendor to request an updated driver version that spends less time in its DPC Routine.

 

How to troubleshoot a 0x133 (1, …

Determining the cause of a stop 0x133 with a first parameter of 1 is a bit more difficult because the problem is a result of DPCs running from multiple drivers, so the call stack is insufficient to determine the culprit.  To troubleshoot this stop, first make sure that the NT Kernel Logger or Circular Kernel Context Logger ETW traces are enabled on the system.  (For directions on setting this up, see https://blogs.msdn.com/b/ntdebugging/archive/2009/12/11/test.aspx.)

 

Once the logging is enabled and the system bug checks, dump out the list of ETW loggers using !wmitrace.strdump. Find the ID of the NT Kernel logger or the Circular logger.  You can then use !wmitrace.logsave (ID) (path to ETL) to write out the ETL log to a file.  Load it up with Windows Performance Analyzer and add the DPC or DPC/ISR Duration by Module, Function view (located in the Computation group) to your current analysis window:

 

 

Next, make sure the table is also shown by clicking the box in the upper right of the view:

 

 

Ensure that the Address column is added on the left of the gold bar, then expand each address entry to see individual DPC enters/exits for each function.  Using this data, you can determine which DPC routines took the longest by looking at the inclusive duration column, which should be added to the right of the gold bar: 

 

In this case, these DPCs took 1 second, which is well over the recommended maximum of 100 us.  The module column (and possible the function column, if you have symbols) will show which driver is responsible for that DPC routine.  Since our ECHO driver was based on WDF, that is the module named here.

 

For an example of doing this type of analysis in xperf, see https://blogs.msdn.com/b/ntdebugging/archive/2008/04/03/windows-performance-toolkit-xperf.aspx.

 

More Information

For additional information about Stop 0x133 errors, see this page on MSDN: https://msdn.microsoft.com/en-us/library/windows/hardware/jj154556(v=vs.85).aspx.

 

For DPC timing recommendations and for advice on capturing DPC timing information using tracelog, see https://msdn.microsoft.com/en-us/library/windows/hardware/ff545764(v=vs.85).aspx.

 

Guidelines for writing DPC routines can be found at https://msdn.microsoft.com/en-us/library/windows/hardware/ff546551(v=vs.85).aspx.

 

 

-Matt Burrough