Debugging a bluescreen at home
Hi, my name is Chad. I work as an escalation engineer for Microsoft’s OEM support team.
A while back, I encountered an interesting crash on one of my computers at home, and I thought I’d post about how I debugged it.
This particular machine had been humming along quite happily for some time, but one day while I was scanning some photos, it bluescreened. Naturally, I hoped it was just a fluke, but after it happened a few more times while doing the same thing, I decided to debug it.
Ordinarily, if a machine crashes when performing a specific activity, like scanning photos, my first inclination would be to suspect a bug in one of the drivers involved in that activity, like the scanner driver or the USB driver. But in this case, I had been using this scanner for a long time, with the same drivers, and never had this problem, so this sudden crashing was kind of mysterious.
Let's see what we can tell from the dump!
The first order of business when looking at a crash dump is the !analyze -v command. I've trimmed some of it here for brevity, but it goes something like this:
kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************
KERNEL_MODE_EXCEPTION_NOT_HANDLED (8e)
This is a very common bugcheck. Usually the exception address pinpoints
the driver/function that caused the problem. Always note this address
as well as the link date of the driver/image that contains this address.
Some common problems are exception code 0x80000003. This means a hard
coded breakpoint or assertion was hit, but this system was booted
/NODEBUG. This is not supposed to happen as developers should never have
hardcoded breakpoints in retail code, but ...
If this happens, make sure a debugger gets connected, and the
system is booted /DEBUG. This will let us see why this breakpoint is
happening.
Arguments:
Arg1: c0000005, The exception code that was not handled
Arg2: 8738e300, The address that the exception occurred at
Arg3: b9b3dc7c, Trap Frame
Arg4: 00000000
Debugging Details:
------------------
EXCEPTION_CODE: (NTSTATUS) 0xc0000005 - The instruction at "0x%08lx" referenced memory at "0x%08lx". The memory could not be "%s".
FAULTING_IP:
+ffffffff8738e300
8738e300 0000 add [eax],al
TRAP_FRAME: b9b3dc7c -- (.trap ffffffffb9b3dc7c)
ErrCode = 00000002
eax=00000001 ebx=bc514c68 ecx=0001065e edx=bc510000 esi=00000955 edi=b9b3dd64
eip=8738e300 esp=b9b3dcf0 ebp=b9b3dd08 iopl=0 nv up ei pl zr na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010246
8738e300 0000 add [eax],al ds:0023:00000001=??
Resetting default scope
DEFAULT_BUCKET_ID: DRIVER_FAULT
BUGCHECK_STR: 0x8E
LAST_CONTROL_TRANSFER: from 8051d6a7 to 8053331e
STACK_TEXT:
WARNING: Frame IP not in any known module. Following frames may be wrong.
b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet
0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc
77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93
...
From the stack trace, we can tell that NikonScan4.dll made a call into User32, which ultimately ended up calling into win32k.sys down in kernel mode. Win32k was in a function called ValidateHwnd() and then tried to call some function at address 0x8738e300, at which point we tried to dereference an invalid pointer and crashed.
What’s at 0x8738e300?
kd> dc 8738e300
8738e300 00000000 00000001 00000000 87360350 ............P.6.
8738e310 00000000 00000001 f71af9fe f71b0030 ............0...
8738e320 f71afb0e f71afbb4 f71b0098 f71b0214 ................
8738e330 f71afef6 f71aff8e 07fef800 00000000 ................
8738e340 f71afffc 00000000 0a0e000a 644c6d4d ............MmLd
8738e350 8732ea58 870303e0 ffffffff ffffffff X.2.............
8738e360 00000012 00000000 f797f000 f7989905 ................
8738e370 0000c000 00500050 e1971458 00160016 ....P.P.X.......
kd> !pool 8738e300 2
Pool page 8738e300 region is Nonpaged pool
*8738e2f8 size: 50 previous size: 8 (Allocated) *NV
Owning component : Unknown (update pooltag.txt)
Well, that’s bad. 0x8738e300 isn’t actually a valid address of a function. That location contains some user data (specifically, some nonpaged pool).
So, that's why we blew up: the ValidateHwnd() function in win32k made a call to this bad address which contains data instead of code! Let's see if we can figure out why it did this. We can find the return address in win32k!ValidateHwnd on the stack, and unassemble the instructions leading up to the point where we ran off into the weeds.
kd> kv L8
ChildEBP RetAddr Args to Child
WARNING: Frame IP not in any known module. Following frames may be wrong.
b9b3dcec bf801619 b9b3dd64 0012eee4 bf81551f 0x8738e300
b9b3dd08 bf81553c b9b3dd64 0012eee4 bf81551f win32k!ValidateHwnd+0x5c (FPO: [Non-Fpo])
b9b3dd50 804de7ec 0001065e 012eb244 00000000 win32k!NtUserInvalidateRect+0x1d (FPO: [Non-Fpo])
b9b3dd50 7c90eb94 0001065e 012eb244 00000000 nt!KiFastCallEntry+0xf8 (FPO: [0,0] TrapFrame @ b9b3dd64)
0012eecc 77d4b601 0105a943 0001065e 012eb244 ntdll!KiFastSystemCallRet (FPO: [0,0,0])
0012eed0 0105a943 0001065e 012eb244 00000000 USER32!NtUserInvalidateRect+0xc
77d4b601 90909090 001134b8 0300ba00 12ff7ffe NikonScan4!GetSource+0x21e93
77d4b605 001134b8 0300ba00 12ff7ffe 900008c2 0x90909090
The return address from where we made the bad call is bf801619. Let’s unassemble a few instructions backwards from that address (using the handy but often-overlooked “ub” command) and see what the code was doing:
kd> ub bf801619
win32k!ValidateHwnd+0x2f:
bf8015ff 8d1c82 lea ebx,[edx+eax*4]
bf801602 8bc1 mov eax,ecx
bf801604 c1e810 shr eax,0x10
bf801607 663b430a cmp ax,[ebx+0xa]
bf80160b 75ad jnz win32k!ValidateHwnd+0x3d (bf8015ba)
bf80160d 807b0801 cmp byte ptr [ebx+0x8],0x1
bf801611 7573 jnz win32k!ValidateHwnd+0xff (bf801686)
bf801613 ff15e0b298bf call dword ptr [win32k!_imp__PsGetCurrentThread (bf98b2e0)]
OK, so it's pretty simple. Win32k decided which function address to call by reading it from a pointer stored at a hardcoded location (in other words, in a global variable) within Win32k itself. That pointer is located at bf98b2e0. The debugger helpfully tells us that this pointer is intended to contain the address of a function called PsGetCurrentThread, but let’s double-check this and make sure it actually does. At this point, the working assumption would be that this pointer had gotten corrupted somehow.
kd> dd bf98b2e0 L1
bf98b2e0 804e4a15
Interesting. bf98b2e0 contains the value 804e4a15. This is not even close to the bad address the processor actually called! Remember, from the stack trace, we jumped to 8738e300 instead. What does this pointer actually point to?
kd> ln 804e4a15
(804e4a15) nt!PsGetCurrentThread | (804e4a51) nt!KeSetEventBoostPriority
Exact matches:
nt!PsGetCurrentThread = <no type information>
Sure enough, 804e4a15 is the address of nt!PsGetCurrentThread.
So, basically, win32k was trying to call PsGetCurrentThread(), and even had the correct pointer in memory to get there, but the processor instead jumped to a bogus address located in the middle of some user data.
At this point it's pretty safe to say that this is a hardware problem, and furthermore, since the relevant memory looks fine, it looks like a problem within the CPU itself. (I’m not an expert on CPU architecture, but if I were to take a wild guess I’d say that maybe the processor had some problem when reading from its L2 cache.)
Upon discovering this, I decided to crack open my case and take a look. It didn’t take long to spot the problem:
Yes, that's my CPU fan. Notice that the heatsink is completely clogged up with dust!
The resolution to this story: I took a can of compressed air and blew all the dust out of the heatsink. This took care of the problem, and the computer happily scanned hundreds of photos and hasn’t crashed again since. Why was it only crashing when I was scanning photos? Most likely because this was causing the CPU to run at 100% utilization for extended periods of time, and it was simply overheating!
Comments
Anonymous
March 28, 2008
Chad, Thanks for the great post! I enjoy reading walkthroughs of a bugcheck analysis like this - has me trying to think "what's the next step" each step of the way...Anonymous
March 28, 2008
Nice post. Thanks for sharing your analysis...Anonymous
March 28, 2008
Good write up Chad, had a similar problem a few years back with constant bluescreens as soon as the cpu started doing some heavy work... turns out forgot to add the heatsink thermal gel to spread the heat from the cpu... Funny how we always think it's always a software problem!Anonymous
March 28, 2008
I just wanted to say THANK YOU for writing articles like this. I often wonder how to analyze crash output to better my understanding of Windows OS troubleshooting. -SteveAnonymous
March 28, 2008
Oh also if possible could you include a link to your dump files so we can download, import and !analyze like you do? Using your post and comments as a guide to the thought process. Thanks again! [That would be nice, but remember that memory dumps contain the entire contents of memory at the time they are created. As such, they always have potential to contain sensitive personal information. I wouldn’t ever feel comfortable posting a memory dump from my own personal machine (or any machine in a real production environment) on the web for the general public to see. Sorry!]Anonymous
March 28, 2008
In the crash dump, there's a reference to nikonscan: NikonScan4!GetSource+0x21e93 Where does the GetSource symbol come from? I have never seen a third-party driver in the wild with symbols included.Anonymous
March 28, 2008
The comment has been removedAnonymous
March 29, 2008
Hi, i'm learning a lot with your blog. I have a question, what does the 0x90909090 that is at the bottom of the stack trace mean? I suppose it's an address that the debugger couldn't match with any loaded symbol, but why? Thanks, [The debugger doesn’t have any symbols for Nikonscan4.dll. When it doesn’t have symbols for a module, it will try to keep following the stack anyway, but it’s not always capable of doing so. In this case, it got confused (possibly because this function was compiled using Frame-Pointer Omission) and picked the wrong EBP for the next stack frame, leading to bad results.
In a case like this, if you can’t get hold of the symbols for the module, you can dump out the contents of the stack manually by using a command like “dps esp” and try to walk further back through the stack on your own.]Anonymous
March 31, 2008
@ David Moisan: GetSource is probably an exported function from NikonScan4.sys; the symbol engine probably resolved to the nearest export, given the (likely) unavailability of a PDB, and the large offset (0x21e93). [Yes, this is correct.]Anonymous
March 31, 2008
Ha transcurrido un mes y medio desde mi último post , ya me vale. Hace unos días, en el blog de soluciónAnonymous
March 31, 2008
.NET:Improving.NetPerformanceWithNgen:TheNativeImageGeneratorDefiningRoutesusingRegul...Anonymous
April 02, 2008
Mike: I have a question, what does the 0x90909090 that is at the bottom of the stack trace mean? I suppose it's an address that the debugger couldn't match with any loaded symbol, but why? 0x90 is the x86 NOP (no-operation) opcode; could be the top frame was a region of NOPs.Anonymous
April 09, 2008
I had a similar problem with my video cards overheating. Dust and lint were cause.Anonymous
May 02, 2008
Hello NTDebuggers, we have been very impressed with the responses we’ve gotten to our previous puzzlersAnonymous
December 01, 2008
Indeed a beautiful post. A great learning especially those who are starters. Good Work shown Chad.. Hoping to see more from you.