What Should Never Happen... Did
Hi, this is Bob Golding; I wanted to write a blog about an interesting hardware issue I ran into. Hardware problems can be tricky to isolate. I recently came across one that I thought was interesting and gave an example of how to trace code execution. The machine executed the filler “int 3” instructions generated by the compiler. Execution should never reach these filler instructions, so we needed to determine how the instruction pointer got there.
What was the issue?
The issue was a bug check 8E (unhandled exception). The exception was a debug exception (80000003), because a filler INT 3 instruction was executed.
1: kd> .bugcheck
Bugcheck code 0000008E
Arguments 80000003 8082e3e0 f78aec38 00000000
Below is the trap frame (the trap frame is the third argument in the bugcheck code). Note that the actual trapping instruction is 8082e3e0, the instruction pointer is incremented before the INT 3 generates a trap. The correct EIP is reported in the bug check values.
1: kd> .trap f78aec38
ErrCode = 00000000
eax=0e364a01 ebx=f78aed50 ecx=f78aecb8 edx=000000e1 esi=f7727ec0 edi=f75d9ca2
eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000202
nt!KiCheckForKernelApcDelivery+0x37:
8082e3e1 cc int 3
The machine executed code at the end of a function that should never be executed. This should never happen.
1: kd> ub 8082e3e1
nt!KiCheckForKernelApcDelivery+0x1c:
8082e3c6 32c9 xor cl,cl
8082e3c8 ff2508118080 jmp dword ptr [nt!_imp_KfLowerIrql (80801108)]
8082e3ce 64a124010000 mov eax,dword ptr fs:[00000124h]
8082e3d4 c6403d01 mov byte ptr [eax+3Dh],1
8082e3d8 ff2598108080 jmp dword ptr [nt!_imp_HalRequestSoftwareInterrupt (80801098)]
8082e3de cc int 3 <<< This is after the function.
8082e3df cc int 3
8082e3e0 cc int 3
OK, So Now What?
Now, we need to find the execution path that caused the machine to execute this INT 3. There are places to look to find clues that will tell us. The first place to start looking is the stack. If a “call” instruction was made, the return will be pushed on the stack. This way we can try to determine if we arrived at this bad instruction pointer from a call or a ret instruction.
Using the value from esp in the above trap frame, let’s dump the stack.
1: kd> dps f78aecac
f78aecac 80a5d8fc hal!HalpIpiHandler+0xcc <<< Interesting?
f78aecb0 f78aecc0
f78aecb4 00000000
f78aecb8 00000002
f78aecbc 000000e1
f78aecc0 f78aed50
f78aecc4 f75d9ca2
f78aecc8 badb0d00
f78aeccc 000086a8
In looking at the stack dump, we see that there may have been a call from HalpIpiHandler. Let’s dump the code leading up to hal!HalpIpiHandler+0xcc to see what it did.
1: kd> ub 80a5d8fc
hal!HalpIpiHandler+0xb1:
80a5d8e1 e8d82b0000 call hal!HalBeginSystemInterrupt (80a604be)
80a5d8e6 0ac0 or al,al
80a5d8e8 7509 jne hal!HalpIpiHandler+0xc3 (80a5d8f3)
80a5d8ea 83c408 add esp,8
80a5d8ed ff2510b0a580 jmp dword ptr [hal!_imp_Kei386EoiHelper (80a5b010)]
80a5d8f3 6a00 push 0
80a5d8f5 55 push ebp
80a5d8f6 ff1520b0a580 call dword ptr [hal!_imp__KiIpiServiceRoutine (80a5b020)]
In the above assembly, we can see that there is a call made using a pointer in the import table. Now, let’s have a look at that pointer.
1: kd> dd 80a5b020 l 1
80a5b020 8082e3e4
The pointer is very close to the instruction we trapped on. Is this a coincidence?
It looks like due to an effective address calculation failure, the machine starting executing at 8082e3e0 instead of 8082e3e4. Somewhere in the data path the processor executing this instruction stream dropped bit three, turning a 4 into a 0.
1: kd> ?0y0100
Evaluate expression: 4 = 00000000`00000004
1: kd> ?0y0000
Evaluate expression: 0 = 00000000`00000000
What does all of this mean?
There is some circumstantial evidence here that the machine was in the IPI handler. The IPI Handler is used in multiprocessor systems so that one processor may interrupt another. So, how can we further prove this is where we were? Let’s try to match the trap frame registers to the assembly from HalpIpiHandler before it calls KiIpiServiceRoutine.
1: kd> .trap f78aec38
ErrCode = 00000000
eax=0e364a01 ebx=f78aed50 ecx=f78aecb8 edx=000000e1 esi=f7727ec0 edi=f75d9ca2
eip=8082e3e1 esp=f78aecac ebp=f78aecc0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00000202
nt!KiCheckForKernelApcDelivery+0x37:
8082e3e1 cc int 3
1: kd> ub 80a5d8fc l10
hal!HalpIpiHandler+0x94:
80a5d8c4 c74508000ddbba mov dword ptr [ebp+8],0BADB0D00h <<< EBP+ 8 does equal badb0d00
80a5d8cb 895d00 mov dword ptr [ebp],ebx <<< ebp does equal ebx
80a5d8ce 897d04 mov dword ptr [ebp+4],edi <<< ebp+4 does equal edi
80a5d8d1 68e1000000 push 0E1h
80a5d8d6 83ec04 sub esp,4
80a5d8d9 54 push esp
80a5d8da 68e1000000 push 0E1h
80a5d8df 6a1d push 1Dh
80a5d8e1 e8d82b0000 call hal!HalBeginSystemInterrupt (80a604be)
80a5d8e6 0ac0 or al,al <<< al is equal to 1
80a5d8e8 7509 jne hal!HalpIpiHandler+0xc3 (80a5d8f3)
80a5d8ea 83c408 add esp,8
80a5d8ed ff2510b0a580 jmp dword ptr [hal!_imp_Kei386EoiHelper (80a5b010)]
80a5d8f3 6a00 push 0
80a5d8f5 55 push ebp
80a5d8f6 ff1520b0a580 call dword ptr [hal!_imp__KiIpiServiceRoutine (80a5b020)]
Below is the stack data that the instructions above produced (borrowed from the dps output above). It matches, so the machine did execute the above instructions.
f78aecac 80a5d8fc hal!HalpIpiHandler+0xcc <<< Return Pushed by call instruction
f78aecb0 f78aecc0 <<< Pushed EBP
f78aecb4 00000000 <<< Pushed a 0
Finally, the last bit of evidence:
Let's view the IPI state using the debugger command !ipi. From this output we can see that processor 1 is the receiver of a cross interrupt from Processor 0. This is consistent with the data we found on the stack.
1: kd> !ipi
IPI State for Processor 0
As a sender, awaiting packet completion from processor 1.
TargetSet 2 PacketBarrier e IpiFrozen 2 [Frozen]
IpiFrame 8089a570 SignalDone 00000000 RequestSummary 0
Packet State Active WorkerRoutine nt!KiFlushTargetMultipleTb
Parameter[0] 00000000 Parameter[1] 80899f10 Parameter[2] 80899f04
IPI State for Processor 1
As a receiver, the following unhandled requests are pending: [Packet] [DPC]
TargetSet 0 PacketBarrier 0 IpiFrozen 0 [Running]
IpiFrame b5ad7be8 SignalDone ffdff120 RequestSummary 2 [DPC]
Packet State Stale WorkerRoutine nt!KiFlushTargetMultipleTb
Parameter[0] 00000000 Parameter[1] b5ad7950 Parameter[2] b5ad7948
What went wrong?
Based on the evidence in this dump it appears that call instruction transferred execution to the wrong address. The machine ended up executing at address 8082e3e0 instead of 8082e3e4, a single bit difference. This same bit was flipped in several crashes from this machine, so all the evidence pointed to a faulty processor. After replacing the processor we were running on when we bugchecked, the issue did not occur again.
Hardware can sometimes cause pretty specific failures, such as the flipped bit we see here. To determine that this failure was a hardware issue, we had to reconstruct the execution path and trace how we ended up at the failing instruction. We were able to match the register contents to what they would have been before the call to KiIpiServiceRoutine. This demonstrated that the call should have been made to KiIpiServiceRoutine, but it unexpectedly went to the wrong address.