A Classic Case of Whodunit
Sometimes we encounter problems that just don't make sense. I don't mean a high powered lawyer talking about Chewbacca, I mean sometimes computers do things that defy logic.
The below bugcheck is one example. At first glance, some people will blame the first third party code they see and declare "It's the anti-virus!" That is a classic example of people defying logic, but this article is about computers defying logic, so there must be something else going on here.
3: kd> .bugcheck
Bugcheck code 00000050
Arguments c73fdb0b 00000001 809327b8 00000000
3: kd> k
ChildEBP RetAddr
b7b4d5a8 8085e6cd nt!KeBugCheckEx+0x1b
b7b4d620 8088bc18 nt!MmAccessFault+0xb25
b7b4d620 809327b8 nt!_KiTrap0E+0xdc
b7b4d6bc 808ef973 nt!ObReferenceObjectByHandle+0x16e
b7b4d75c 80888c7c nt!NtQueryInformationFile+0xcd
b7b4d75c 8082ea49 nt!KiFastCallEntry+0xfc
b7b4d7e8 b88db606 nt!ZwQueryInformationFile+0x11
b7b4d864 b88db6c3 NAVAP+0x2e606
b7b4d88c b88b30f6 NAVAP+0x2e6c3
b7b4d8e0 b88b3338 NAVAP+0x60f6
b7b4d900 b88b6a37 NAVAP+0x6338
b7b4d948 b8993348 NAVAP+0x9a37
b7b4d96c b8995af8 SYMEVENT!SYMEvent_GetSubTask+0x1438
b7b4d9e8 b898fe32 SYMEVENT!EventObjectDestroy+0x338
b7b4d9f8 b89963e8 SYMEVENT+0x4e32
b7b4da48 8081dcdf SYMEVENT!EventObjectCreate+0x3e8
b7b4da5c 808f8275 nt!IofCallDriver+0x45
b7b4db44 80936b13 nt!IopParseDevice+0xa35
b7b4dbc4 80932e04 nt!ObpLookupObjectName+0x5a9
b7b4dc18 808ea231 nt!ObOpenObjectByName+0xea
b7b4dc94 808eb4cb nt!IopCreateFile+0x447
b7b4dcf0 808edf4a nt!IoCreateFile+0xa3
b7b4dd30 80888c7c nt!NtCreateFile+0x30
b7b4dd30 7c82ed54 nt!KiFastCallEntry+0xfc
059293e0 00000000 0x7c82ed54
The basic premise of all troubleshooting is logic. I often use a series of questions to shape the logic for the problem I am investigating. I start all blue screen debugs with the same question, "Why did the system crash?". The answer to this question is usually in the bugcheck code.
3: kd> .bugcheck
Bugcheck code 00000050
Arguments c73fdb0b 00000001 809327b8 00000000
The debugger.chm help file has a description of this error under the topic "Bug Check 0x50: PAGE_FAULT_IN_NONPAGED_AREA". It explains that this error happens when invalid memory is accessed, and it shows what the four bugcheck parameters mean.
Parameter | Description |
1 | Memory address referenced |
2 | 0: Read operation 1: Write operation |
3 | Address that referenced memory (if known) |
4 | reserved |
Interpreting our bugcheck code, address c73fdb0b was written to by the instruction at address 809327b8
If you refer to the documentation on page fault handling in x86 you will see that the processor stores the address being faulted on in cr2 prior to calling the page fault handler. We can use this to reconfirm the data in the bugcheck code.
3: kd> r cr2
Last set context:
cr2=c73fdb0b
We can confirm that virtual address c73fdb0b really is invalid by looking at the PTE.
3: kd> !pte c73fdb0b
VA c73fdb0b
PDE at C06031C8 PTE at C0639FE8
contains 000000021BB36963 contains 0000000000000000
pfn 21bb36 -G-DA--KWEV not valid
The next question is why did the instruction at address 809327b8 attempt to write to c73fdb0b? The call stack and trap frame can answer this question.
3: kd> kv
ChildEBP RetAddr Args to Child
b7b4d5a8 8085e6cd 00000050 c73fdb0b 00000001 nt!KeBugCheckEx+0x1b (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4d620 8088bc18 00000001 c73fdb0b 00000000 nt!MmAccessFault+0xb25 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4d620 809327b8 00000001 c73fdb0b 00000000 nt!_KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ b7b4d638)
b7b4d6bc 808ef973 000012e4 00000080 00000180 nt!ObReferenceObjectByHandle+0x16e (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4d75c 80888c7c 000012e4 b7b4d858 b7b4d810 nt!NtQueryInformationFile+0xcd (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4d75c 8082ea49 000012e4 b7b4d858 b7b4d810 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4d778)
b7b4d7e8 b88db606 000012e4 b7b4d858 b7b4d810 nt!ZwQueryInformationFile+0x11 (FPO: [5,0,0])
WARNING: Stack unwind information not available. Following frames may be wrong.
b7b4d864 b88db6c3 000012e4 e1e5ba38 00000000 NAVAP+0x2e606
b7b4d88c b88b30f6 8b0bff88 b88b3040 b7b4d8fc NAVAP+0x2e6c3
b7b4d8e0 b88b3338 8b0bff88 00003f80 b88b3040 NAVAP+0x60f6
b7b4d900 b88b6a37 00000000 b7b4d9ac b88b6a42 NAVAP+0x6338
b7b4d948 b8993348 e11998b8 b7b4d9a4 00000001 NAVAP+0x9a37
b7b4d96c b8995af8 00000000 b7b4d9ac b8997526 SYMEVENT!SYMEvent_GetSubTask+0x1438
b7b4d9e8 b898fe32 b7b4da2c e162be44 b7b4da2c SYMEVENT!EventObjectDestroy+0x338
b7b4d9f8 b89963e8 b7b4da2c 8b60cc50 b7b4da2c SYMEVENT+0x4e32
b7b4da48 8081dcdf 8b916f10 8b3557c8 8b3557c8 SYMEVENT!EventObjectCreate+0x3e8
b7b4da5c 808f8275 b7b4dc04 8cb7dcb0 00000000 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)
b7b4db44 80936b13 8cb7dcc8 00000000 8b452008 nt!IopParseDevice+0xa35 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dbc4 80932e04 00000000 b7b4dc04 00000040 nt!ObpLookupObjectName+0x5a9 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dc18 808ea231 00000000 00000000 2775a801 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dc94 808eb4cb 059293e8 80100080 05929384 nt!IopCreateFile+0x447 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dcf0 808edf4a 059293e8 80100080 05929384 nt!IoCreateFile+0xa3 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dd30 80888c7c 059293e8 80100080 05929384 nt!NtCreateFile+0x30 (FPO: [Non-Fpo]) (CONV: stdcall)
b7b4dd30 7c82ed54 059293e8 80100080 05929384 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4dd64)
059293e0 00000000 00000000 00000000 00000000 0x7c82ed54
3: kd> .trap b7b4d638
ErrCode = 00000002
eax=00000000 ebx=8b40af78 ecx=00000180 edx=b7b4d6e8 esi=8b81bb40 edi=e5c265c8
eip=809327b8 esp=b7b4d6ac ebp=b7b4d6bc iopl=0 nv up ei ng nz na pe nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286
nt!ObReferenceObjectByHandle+0x16e:
809327b8 894a04 mov dword ptr [edx+4],ecx ds:0023:b7b4d6ec=00000000
Reading the instruction that we failed on, at address 809327b8, we can see that it dereferenced edx+4 where edx is b7b4d6e8. This would result in a write to the memory at address b7b4d6ec.
3: kd> !pte b7b4d6ec
VA b7b4d6ec
PDE at C0602DE8 PTE at C05BDA68
contains 000000010BA77863 contains 00000001BB583963
pfn 10ba77 ---DA--KWEV pfn 1bb583 -G-DA--KWEV
3: kd> dd b7b4d6ec
b7b4d6ec 00000000 00000000 00000000 00000000
b7b4d6fc 00000000 b7b4d804 808ef01e 00000098
b7b4d70c 00000005 00000000 8b81bb40 808ef045
b7b4d71c b7b4d86c 00000000 b7b4d838 b7b4d800
b7b4d72c 00000000 00000000 00000007 00000001
b7b4d73c 00000000 00000000 b7b4d6dc 00000000
b7b4d74c ffffffff 80880c90 80802b70 ffffffff
b7b4d75c b7b4d778 80888c7c 000012e4 b7b4d858
This is where the logic starts to break down. The code wanted to write to b7b4d6ec, which is a valid address. The bugcheck code and cr2 say we failed writing to address c73fdb0b. This does not make sense.
The analogy I often use for scenarios such as this one is: If I ask my intern to get me a Mountain Dew from the break room, and he comes back to say we are out of coffee, am I at fault or is my intern broken? Applying the same logic to this crash, if ObReferenceObjectByHandle asks the hardware to write to address b7b4d6ec and the hardware came back saying it cannot write to address c73fdb0b, is the software at fault or is the hardware broken? Clearly the hardware is broken if it does not do what the software asks of it.
In this instance, the customer replaced the processor and afterwards the system was stable.
Comments
Anonymous
June 30, 2011
I've seen this type of nonsensical behavior on overclocked systems. If it was overclocked then what the user has done is probably just replaced one CPU for another that has slightly better margin over its rated speed. [In this scenario the server was produced by a big name in the server hardware business, most likely the BIOS doesn't allow for overclocking. However, the real cause of hardware malfunctions when overclocking is heat and it is certainly possible that when replacing the processor they also cleaned the dust out of the heat sinks.]Anonymous
June 30, 2011
Reat post! It feels real solid, I like your way of color-coding the adresses, which makes it easier to follow. Txs. [Thanks for the feedback. We'll keep this color coding scheme in mind when editing other posts.]Anonymous
July 04, 2011
The comment has been removedAnonymous
August 24, 2011
Fantastic example of identifying a hardware fault. I also agree that the color-coding was helpful in following along. Got to love it when a processor fails. If you can't trust the CPU, who can you trust? :)Anonymous
February 22, 2013
You should look at several crash dumps. If the stack trace is always same, it's likely a driver bug. Different stack trace in every dump is a sign of hardware failure, but can be a software bug (e.g. use after free) too. [There are many possible causes for a bugcheck that changes with each crash. The most common reason for getting a different stack trace in every dump is memory corruption, where a driver writes to memory it does not own. This type of corruption is purely a software problem, and changing hardware will not eliminate it. On the other hand I have seen hardware malfunctions that manifest themselves the same way each time, and generate a nearly identical stack in every crash.]