A Classic Case of Whodunit

Sometimes we encounter problems that just don't make sense.  I don't mean a high powered lawyer talking about Chewbacca, I mean sometimes computers do things that defy logic.

 

The below bugcheck is one example.  At first glance, some people will blame the first third party code they see and declare "It's the anti-virus!"  That is a classic example of people defying logic, but this article is about computers defying logic, so there must be something else going on here.

3: kd> .bugcheck

Bugcheck code 00000050

Arguments c73fdb0b 00000001 809327b8 00000000

3: kd> k

ChildEBP RetAddr

b7b4d5a8 8085e6cd nt!KeBugCheckEx+0x1b

b7b4d620 8088bc18 nt!MmAccessFault+0xb25

b7b4d620 809327b8 nt!_KiTrap0E+0xdc

b7b4d6bc 808ef973 nt!ObReferenceObjectByHandle+0x16e

b7b4d75c 80888c7c nt!NtQueryInformationFile+0xcd

b7b4d75c 8082ea49 nt!KiFastCallEntry+0xfc

b7b4d7e8 b88db606 nt!ZwQueryInformationFile+0x11

b7b4d864 b88db6c3 NAVAP+0x2e606

b7b4d88c b88b30f6 NAVAP+0x2e6c3

b7b4d8e0 b88b3338 NAVAP+0x60f6

b7b4d900 b88b6a37 NAVAP+0x6338

b7b4d948 b8993348 NAVAP+0x9a37

b7b4d96c b8995af8 SYMEVENT!SYMEvent_GetSubTask+0x1438

b7b4d9e8 b898fe32 SYMEVENT!EventObjectDestroy+0x338

b7b4d9f8 b89963e8 SYMEVENT+0x4e32

b7b4da48 8081dcdf SYMEVENT!EventObjectCreate+0x3e8

b7b4da5c 808f8275 nt!IofCallDriver+0x45

b7b4db44 80936b13 nt!IopParseDevice+0xa35

b7b4dbc4 80932e04 nt!ObpLookupObjectName+0x5a9

b7b4dc18 808ea231 nt!ObOpenObjectByName+0xea

b7b4dc94 808eb4cb nt!IopCreateFile+0x447

b7b4dcf0 808edf4a nt!IoCreateFile+0xa3

b7b4dd30 80888c7c nt!NtCreateFile+0x30

b7b4dd30 7c82ed54 nt!KiFastCallEntry+0xfc

059293e0 00000000 0x7c82ed54

 

The basic premise of all troubleshooting is logic.  I often use a series of questions to shape the logic for the problem I am investigating.  I start all blue screen debugs with the same question, "Why did the system crash?".  The answer to this question is usually in the bugcheck code.

3: kd> .bugcheck

Bugcheck code 00000050

Arguments c73fdb0b 00000001 809327b8 00000000

 

The debugger.chm help file has a description of this error under the topic "Bug Check 0x50: PAGE_FAULT_IN_NONPAGED_AREA".  It explains that this error happens when invalid memory is accessed, and it shows what the four bugcheck parameters mean.

Parameter Description
1 Memory address referenced
2 0: Read operation 1: Write operation
3 Address that referenced memory (if known)
4 reserved

 

Interpreting our bugcheck code, address c73fdb0b was written to by the instruction at address 809327b8

 

If you refer to the documentation on page fault handling in x86 you will see that the processor stores the address being faulted on in cr2 prior to calling the page fault handler.  We can use this to reconfirm the data in the bugcheck code.

3: kd> r cr2

Last set context:

cr2=c73fdb0b

 

We can confirm that virtual address c73fdb0b really is invalid by looking at the PTE.

3: kd> !pte c73fdb0b

                    VA c73fdb0b

PDE at C06031C8            PTE at C0639FE8

contains 000000021BB36963  contains 0000000000000000

pfn 21bb36    -G-DA--KWEV  not valid

 

The next question is why did the instruction at address 809327b8 attempt to write to c73fdb0b?  The call stack and trap frame can answer this question.

3: kd> kv

ChildEBP RetAddr  Args to Child

b7b4d5a8 8085e6cd 00000050 c73fdb0b 00000001 nt!KeBugCheckEx+0x1b (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d620 8088bc18 00000001 c73fdb0b 00000000 nt!MmAccessFault+0xb25 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d620 809327b8 00000001 c73fdb0b 00000000 nt!_KiTrap0E+0xdc (FPO: [0,0] TrapFrame @ b7b4d638)

b7b4d6bc 808ef973 000012e4 00000080 00000180 nt!ObReferenceObjectByHandle+0x16e (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d75c 80888c7c 000012e4 b7b4d858 b7b4d810 nt!NtQueryInformationFile+0xcd (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4d75c 8082ea49 000012e4 b7b4d858 b7b4d810 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4d778)

b7b4d7e8 b88db606 000012e4 b7b4d858 b7b4d810 nt!ZwQueryInformationFile+0x11 (FPO: [5,0,0])

WARNING: Stack unwind information not available. Following frames may be wrong.

b7b4d864 b88db6c3 000012e4 e1e5ba38 00000000 NAVAP+0x2e606

b7b4d88c b88b30f6 8b0bff88 b88b3040 b7b4d8fc NAVAP+0x2e6c3

b7b4d8e0 b88b3338 8b0bff88 00003f80 b88b3040 NAVAP+0x60f6

b7b4d900 b88b6a37 00000000 b7b4d9ac b88b6a42 NAVAP+0x6338

b7b4d948 b8993348 e11998b8 b7b4d9a4 00000001 NAVAP+0x9a37

b7b4d96c b8995af8 00000000 b7b4d9ac b8997526 SYMEVENT!SYMEvent_GetSubTask+0x1438

b7b4d9e8 b898fe32 b7b4da2c e162be44 b7b4da2c SYMEVENT!EventObjectDestroy+0x338

b7b4d9f8 b89963e8 b7b4da2c 8b60cc50 b7b4da2c SYMEVENT+0x4e32

b7b4da48 8081dcdf 8b916f10 8b3557c8 8b3557c8 SYMEVENT!EventObjectCreate+0x3e8

b7b4da5c 808f8275 b7b4dc04 8cb7dcb0 00000000 nt!IofCallDriver+0x45 (FPO: [Non-Fpo]) (CONV: fastcall)

b7b4db44 80936b13 8cb7dcc8 00000000 8b452008 nt!IopParseDevice+0xa35 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dbc4 80932e04 00000000 b7b4dc04 00000040 nt!ObpLookupObjectName+0x5a9 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dc18 808ea231 00000000 00000000 2775a801 nt!ObOpenObjectByName+0xea (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dc94 808eb4cb 059293e8 80100080 05929384 nt!IopCreateFile+0x447 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dcf0 808edf4a 059293e8 80100080 05929384 nt!IoCreateFile+0xa3 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dd30 80888c7c 059293e8 80100080 05929384 nt!NtCreateFile+0x30 (FPO: [Non-Fpo]) (CONV: stdcall)

b7b4dd30 7c82ed54 059293e8 80100080 05929384 nt!KiFastCallEntry+0xfc (FPO: [0,0] TrapFrame @ b7b4dd64)

059293e0 00000000 00000000 00000000 00000000 0x7c82ed54

3: kd> .trap b7b4d638

ErrCode = 00000002

eax=00000000 ebx=8b40af78 ecx=00000180 edx=b7b4d6e8 esi=8b81bb40 edi=e5c265c8

eip=809327b8 esp=b7b4d6ac ebp=b7b4d6bc iopl=0         nv up ei ng nz na pe nc

cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00010286

nt!ObReferenceObjectByHandle+0x16e:

809327b8 894a04          mov     dword ptr [edx+4],ecx ds:0023:b7b4d6ec=00000000

 

Reading the instruction that we failed on, at address 809327b8, we can see that it dereferenced edx+4 where edx is b7b4d6e8.  This would result in a write to the memory at address b7b4d6ec.

3: kd> !pte b7b4d6ec

                    VA b7b4d6ec

PDE at C0602DE8            PTE at C05BDA68

contains 000000010BA77863  contains 00000001BB583963

pfn 10ba77    ---DA--KWEV  pfn 1bb583    -G-DA--KWEV

 

3: kd> dd b7b4d6ec

b7b4d6ec  00000000 00000000 00000000 00000000

b7b4d6fc  00000000 b7b4d804 808ef01e 00000098

b7b4d70c  00000005 00000000 8b81bb40 808ef045

b7b4d71c  b7b4d86c 00000000 b7b4d838 b7b4d800

b7b4d72c  00000000 00000000 00000007 00000001

b7b4d73c  00000000 00000000 b7b4d6dc 00000000

b7b4d74c  ffffffff 80880c90 80802b70 ffffffff

b7b4d75c  b7b4d778 80888c7c 000012e4 b7b4d858

 

This is where the logic starts to break down.  The code wanted to write to b7b4d6ec, which is a valid address.  The bugcheck code and cr2 say we failed writing to address c73fdb0b.  This does not make sense.

 

The analogy I often use for scenarios such as this one is: If I ask my intern to get me a Mountain Dew from the break room, and he comes back to say we are out of coffee, am I at fault or is my intern broken?  Applying the same logic to this crash, if ObReferenceObjectByHandle asks the hardware to write to address b7b4d6ec and the hardware came back saying it cannot write to address c73fdb0b, is the software at fault or is the hardware broken?  Clearly the hardware is broken if it does not do what the software asks of it.

 

In this instance, the customer replaced the processor and afterwards the system was stable.

Comments

  • Anonymous
    June 30, 2011
    I've seen this type of nonsensical behavior on overclocked systems. If it was overclocked then what the user has done is probably just replaced one CPU for another that has slightly better margin over its rated speed. [In this scenario the server was produced by a big name in the server hardware business, most likely the BIOS doesn't allow for overclocking.  However, the real cause of hardware malfunctions when overclocking is heat and it is certainly possible that when replacing the processor they also cleaned the dust out of the heat sinks.]

  • Anonymous
    June 30, 2011
    Reat post! It feels real solid, I like your way of color-coding the adresses, which makes it easier to follow. Txs. [Thanks for the feedback.  We'll keep this color coding scheme in mind when editing other posts.]

  • Anonymous
    July 04, 2011
    The comment has been removed

  • Anonymous
    August 24, 2011
    Fantastic example of identifying a hardware fault.  I also agree that the color-coding was helpful in following along. Got to love it when a processor fails.  If you can't trust the CPU, who can you trust? :)

  • Anonymous
    February 22, 2013
    You should look at several crash dumps. If the stack trace is always same, it's likely a driver bug. Different stack trace in every dump is a sign of hardware failure, but can be a software bug (e.g. use after free) too. [There are many possible causes for a bugcheck that changes with each crash.  The most common reason for getting a different stack trace in every dump is memory corruption, where a driver writes to memory it does not own.  This type of corruption is purely a software problem, and changing hardware will not eliminate it.  On the other hand I have seen hardware malfunctions that manifest themselves the same way each time, and generate a nearly identical stack in every crash.]