A Classic Case of Whodunit

發行項
06/29/2011

Sometimes we encounter problems that just don't make sense. I don't mean a high powered lawyer talking about Chewbacca, I mean sometimes computers do things that defy logic.

The below bugcheck is one example. At first glance, some people will blame the first third party code they see and declare "It's the anti-virus!" That is a classic example of people defying logic, but this article is about computers defying logic, so there must be something else going on here.

3: kd> .bugcheck

Bugcheck code 00000050

Arguments c73fdb0b 00000001 809327b8 00000000

3: kd> k

ChildEBP RetAddr

b7b4d5a8 8085e6cd nt!KeBugCheckEx+0x1b

b7b4d620 8088bc18 nt!MmAccessFault+0xb25

b7b4d620 809327b8 nt!_KiTrap0E+0xdc

b7b4d6bc 808ef973 nt!ObReferenceObjectByHandle+0x16e

b7b4d75c 80888c7c nt!NtQueryInformationFile+0xcd

b7b4d75c 8082ea49 nt!KiFastCallEntry+0xfc

b7b4d7e8 b88db606 nt!ZwQueryInformationFile+0x11

b7b4d864 b88db6c3 NAVAP+0x2e606

b7b4d88c b88b30f6 NAVAP+0x2e6c3

b7b4d8e0 b88b3338 NAVAP+0x60f6

b7b4d900 b88b6a37 NAVAP+0x6338

b7b4d948 b8993348 NAVAP+0x9a37

b7b4d96c b8995af8 SYMEVENT!SYMEvent_GetSubTask+0x1438

b7b4d9e8 b898fe32 SYMEVENT!EventObjectDestroy+0x338

b7b4d9f8 b89963e8 SYMEVENT+0x4e32

b7b4da48 8081dcdf SYMEVENT!EventObjectCreate+0x3e8

b7b4da5c 808f8275 nt!IofCallDriver+0x45

b7b4db44 80936b13 nt!IopParseDevice+0xa35

b7b4dbc4 80932e04 nt!ObpLookupObjectName+0x5a9

b7b4dc18 808ea231 nt!ObOpenObjectByName+0xea

b7b4dc94 808eb4cb nt!IopCreateFile+0x447

b7b4dcf0 808edf4a nt!IoCreateFile+0xa3

b7b4dd30 80888c7c nt!NtCreateFile+0x30

b7b4dd30 7c82ed54 nt!KiFastCallEntry+0xfc

059293e0 00000000 0x7c82ed54

The basic premise of all troubleshooting is logic. I often use a series of questions to shape the logic for the problem I am investigating. I start all blue screen debugs with the same question, "Why did the system crash?". The answer to this question is usually in the bugcheck code.

3: kd> .bugcheck

Bugcheck code 00000050

Arguments c73fdb0b 00000001 809327b8 00000000

The debugger.chm help file has a description of this error under the topic "Bug Check 0x50: PAGE_FAULT_IN_NONPAGED_AREA". It explains that this error happens when invalid memory is accessed, and it shows what the four bugcheck parameters mean.

Parameter	Description
1	Memory address referenced
2	0: Read operation 1: Write operation
3	Address that referenced memory (if known)
4	reserved

Interpreting our bugcheck code, address c73fdb0b was written to by the instruction at address 809327b8

If you refer to the documentation on page fault handling in x86 you will see that the processor stores the address being faulted on in cr2 prior to calling the page fault handler. We can use this to reconfirm the data in the bugcheck code.

3: kd> r cr2

Last set context:

cr2=c73fdb0b

We can confirm that virtual address c73fdb0b really is invalid by looking at the PTE.

3: kd> !pte c73fdb0b

VA c73fdb0b

PDE at C06031C8 PTE at C0639FE8

contains 000000021BB36963 contains 0000000000000000

pfn 21bb36 -G-DA--KWEV not valid

The next question is why did the instruction at address 809327b8 attempt to write to c73fdb0b? The call stack and trap frame can answer this question.

3: kd> kv

ChildEBP RetAddr Args to Child

b7b4d5a8 8085e6cd 00000050 c73fdb0b 00000001 nt!KeBugCheckEx+0x1b (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d620 8088bc18 00000001 c73fdb0b 00000000 nt!MmAccessFault+0xb25 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d620 809327b8 00000001 c73fdb0b 00000000 nt!_KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ b7b4d638)

b7b4d6bc 808ef973 000012e4 00000080 00000180 nt!ObReferenceObjectByHandle+0x16e (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d75c 80888c7c 000012e4 b7b4d858 b7b4d810 nt!NtQueryInformationFile+0xcd (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d75c 8082ea49 000012e4 b7b4d858 b7b4d810 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4d778)

b7b4d7e8 b88db606 000012e4 b7b4d858 b7b4d810 nt!ZwQueryInformationFile+0x11 (FPO: [5,0,0])

WARNING: Stack unwind information not available. Following frames may be wrong.

b7b4d864 b88db6c3 000012e4 e1e5ba38 00000000 NAVAP+0x2e606

b7b4d88c b88b30f6 8b0bff88 b88b3040 b7b4d8fc NAVAP+0x2e6c3

b7b4d8e0 b88b3338 8b0bff88 00003f80 b88b3040 NAVAP+0x60f6

b7b4d900 b88b6a37 00000000 b7b4d9ac b88b6a42 NAVAP+0x6338

b7b4d948 b8993348 e11998b8 b7b4d9a4 00000001 NAVAP+0x9a37

b7b4d96c b8995af8 00000000 b7b4d9ac b8997526 SYMEVENT!SYMEvent_GetSubTask+0x1438

b7b4d9e8 b898fe32 b7b4da2c e162be44 b7b4da2c SYMEVENT!EventObjectDestroy+0x338

b7b4d9f8 b89963e8 b7b4da2c 8b60cc50 b7b4da2c SYMEVENT+0x4e32

b7b4da48 8081dcdf 8b916f10 8b3557c8 8b3557c8 SYMEVENT!EventObjectCreate+0x3e8

b7b4da5c 808f8275 b7b4dc04 8cb7dcb0 00000000 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)

b7b4db44 80936b13 8cb7dcc8 00000000 8b452008 nt!IopParseDevice+0xa35 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dbc4 80932e04 00000000 b7b4dc04 00000040 nt!ObpLookupObjectName+0x5a9 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dc18 808ea231 00000000 00000000 2775a801 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dc94 808eb4cb 059293e8 80100080 05929384 nt!IopCreateFile+0x447 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dcf0 808edf4a 059293e8 80100080 05929384 nt!IoCreateFile+0xa3 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dd30 80888c7c 059293e8 80100080 05929384 nt!NtCreateFile+0x30 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dd30 7c82ed54 059293e8 80100080 05929384 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4dd64)

059293e0 00000000 00000000 00000000 00000000 0x7c82ed54

3: kd> .trap b7b4d638

ErrCode = 00000002

eax=00000000 ebx=8b40af78 ecx=00000180 edx=b7b4d6e8 esi=8b81bb40 edi=e5c265c8

eip=809327b8 esp=b7b4d6ac ebp=b7b4d6bc iopl=0 nv up ei ng nz na pe nc

cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010286

nt!ObReferenceObjectByHandle+0x16e:

809327b8 894a04 mov dword ptr [edx+4],ecx ds:0023:b7b4d6ec=00000000

Reading the instruction that we failed on, at address 809327b8, we can see that it dereferenced edx+4 where edx is b7b4d6e8. This would result in a write to the memory at address b7b4d6ec.

3: kd> !pte b7b4d6ec

VA b7b4d6ec

PDE at C0602DE8 PTE at C05BDA68

contains 000000010BA77863 contains 00000001BB583963

pfn 10ba77 ---DA--KWEV pfn 1bb583 -G-DA--KWEV

3: kd> dd b7b4d6ec

b7b4d6ec 00000000 00000000 00000000 00000000

b7b4d6fc 00000000 b7b4d804 808ef01e 00000098

b7b4d70c 00000005 00000000 8b81bb40 808ef045

b7b4d71c b7b4d86c 00000000 b7b4d838 b7b4d800

b7b4d72c 00000000 00000000 00000007 00000001

b7b4d73c 00000000 00000000 b7b4d6dc 00000000

b7b4d74c ffffffff 80880c90 80802b70 ffffffff

b7b4d75c b7b4d778 80888c7c 000012e4 b7b4d858

This is where the logic starts to break down. The code wanted to write to b7b4d6ec, which is a valid address. The bugcheck code and cr2 say we failed writing to address c73fdb0b. This does not make sense.

The analogy I often use for scenarios such as this one is: If I ask my intern to get me a Mountain Dew from the break room, and he comes back to say we are out of coffee, am I at fault or is my intern broken? Applying the same logic to this crash, if ObReferenceObjectByHandle asks the hardware to write to address b7b4d6ec and the hardware came back saying it cannot write to address c73fdb0b, is the software at fault or is the hardware broken? Clearly the hardware is broken if it does not do what the software asks of it.

In this instance, the customer replaced the processor and afterwards the system was stable.

Comments

Anonymous
June 30, 2011
I've seen this type of nonsensical behavior on overclocked systems. If it was overclocked then what the user has done is probably just replaced one CPU for another that has slightly better margin over its rated speed. [In this scenario the server was produced by a big name in the server hardware business, most likely the BIOS doesn't allow for overclocking. However, the real cause of hardware malfunctions when overclocking is heat and it is certainly possible that when replacing the processor they also cleaned the dust out of the heat sinks.]
Anonymous
June 30, 2011
Reat post! It feels real solid, I like your way of color-coding the adresses, which makes it easier to follow. Txs. [Thanks for the feedback. We'll keep this color coding scheme in mind when editing other posts.]
Anonymous
July 04, 2011
The comment has been removed
Anonymous
August 24, 2011
Fantastic example of identifying a hardware fault. I also agree that the color-coding was helpful in following along. Got to love it when a processor fails. If you can't trust the CPU, who can you trust? :)
Anonymous
February 22, 2013
You should look at several crash dumps. If the stack trace is always same, it's likely a driver bug. Different stack trace in every dump is a sign of hardware failure, but can be a software bug (e.g. use after free) too. [There are many possible causes for a bugcheck that changes with each crash. The most common reason for getting a different stack trace in every dump is memory corruption, where a driver writes to memory it does not own. This type of corruption is purely a software problem, and changing hardware will not eliminate it. On the other hand I have seen hardware malfunctions that manifest themselves the same way each time, and generate a nearly identical stack in every crash.]

共用方式為

A Classic Case of Whodunit

Comments

其他資源