Smoking Gun Pool Corruption
Hello, my name is Ron Stock and I’m an Escalation Engineer on the Microsoft Platforms Global Escalation Services Team. Today I’m going to talk about pool corruption which manifests itself in various ways. It’s usually hard to track down because the culprit is long gone when the machine crashes. Tools such as Special Pool make our debug lives easier; however tracking down corruption doesn’t always have to make you pull your hair out. In some cases simply re-tracing the steps of the crash can reveal a smoking gun.
Let’s take a look at a real world example. First we need to be in the right context so we set the trap frame to give us the register context when the machine crashed.
2: kd> .trap 0xfffffffff470662c
ErrCode = 00000002
eax=35303132 ebx=fd24d640 ecx=fd24d78c edx=fd24d784 esi=fd24d598 edi=fd24d610
eip=e083f7a5 esp=f47066a0 ebp=f47066e0 iopl=0 nv up ei pl nz na po nc
cs=0008 ss=0010 ds=0023 es=0023 fs=0030 gs=0000 efl=00010202
nt!KeWaitForSingleObject+0x25b:
e083f7a5 ff4818 dec dword ptr [eax+18h] ds:0023:3530314a=????????
From the register output we can tell that the system crashed while attempting to dereference a pointer at memory location [eax+18h]. The value stored in register eax is probably the address of a structure given that the code is attempting to dereference offset 18 from the base of eax. Currently eax is pointing to 0x35303132 which is clearly not a valid kernel mode address. Most kernel mode addresses on 32-bit systems will be above the 0x80000000 range assuming the machine is not using something like the /3GB switch. Our mission now is to determine how eax was set.
First we’ll unassemble the failing function using the UF command.
2: kd> uf nt!KeWaitForSingleObject
…..
……
…..
nt!KeWaitForSingleObject+0x25b:
e083f7a5 ff4818 dec dword ptr [eax+18h]
e083f7a8 8b4818 mov ecx,dword ptr [eax+18h]
e083f7ab 3b481c cmp ecx,dword ptr [eax+1Ch]
e083f7ae 0f836ef9ffff jae nt!KeWaitForSingleObject+0x2a3 (e083f122)
I truncated the results of the UF output to conserve space in this blog. Instruction e083f7a5 is the line of code that generated the fault so our focus is to determine how the value of eax was set prior to running instruction e083f7a5. Based on the UF output, instruction e083f11c could have jumped to e083f7a5. Let’s investigate how eax is set before instruction e083f11c jumped to the failing line.
nt!KeWaitForSingleObject+0x244:
e083f107 8d4208 lea eax,[edx+8]
e083f10a 8b4804 mov ecx,dword ptr [eax+4]
e083f10d 8903 mov dword ptr [ebx],eax
e083f10f 894b04 mov dword ptr [ebx+4],ecx
e083f112 8919 mov dword ptr [ecx],ebx
e083f114 895804 mov dword ptr [eax+4],ebx
e083f117 8b4668 mov eax,dword ptr [esi+68h]
e083f11a 85c0 test eax,eax
e083f11c 0f8583060000 jne nt!KeWaitForSingleObject+0x25b (e083f7a5) ß--Jump
Instruction e083f117 moves a value into eax so I’m dumping the value here.
2: kd> dd esi+68h l1
fd24d600 35303132
Bingo! There’s our bad value of 35303132 which is the value of the eax register too, so we probably took this code path. Just to confirm the current value of eax, I’m dumping the register which should mirror the results for eax when using the “r” command to get the full register set.
2: kd> r eax
Last set context:
eax=35303132
Now our focus moves to why dword ptr [esi+68h] points to the bad value? Without source code this can be challenging to narrow down, however the !pool command comes in handy for cases like this.
2: kd> ? esi+68h
Evaluate expression: -47917568 = fd24d600
Let’s examine fd24d600 a little more in detail using the !pool command. The !pool command neatly displays an entire page of 4k kernel memory listing all of the allocations contained on the page. From the output we can determine that our address is allocated from NonPaged pool and holds some sort of thread data, evidenced by the Thre tag next to our allocation. Notice the asterisk next to fd24d578 indicating the start of our pool. Virtual address fd24d578 is the beginning of an 8 byte pool header, and the header is followed by the actual data blob. Be aware that not all memory is allocated from the pool so the !pool command is not always useful. I have more information on !pool later in the blog.
2: kd> !pool fd24d600
Pool page fd24d600 region is Nonpaged pool
fd24d000 size: 270 previous size: 0 (Allocated) Thre (Protected)
fd24d270 size: 10 previous size: 270 (Free) `.lk
fd24d280 size: 40 previous size: 10 (Allocated) Ntfr
fd24d2c0 size: 20 previous size: 40 (Free) CcSc
fd24d2e0 size: 128 previous size: 20 (Allocated) PTrk
fd24d408 size: 128 previous size: 128 (Allocated) PTrk
fd24d530 size: 8 previous size: 128 (Free) Mdl
fd24d538 size: 28 previous size: 8 (Allocated) Ntfn
fd24d560 size: 18 previous size: 28 (Free) Muta
*fd24d578 size: 270 previous size: 18 (Allocated) *Thre (Protected) ß-our pool
fd24d7e8 size: 428 previous size: 270 (Allocated) Mdl
fd24dc10 size: 30 previous size: 428 (Allocated) Even (Protected)
fd24dc40 size: 30 previous size: 30 (Allocated) TCPc
fd24dc70 size: 18 previous size: 30 (Free) SeTd
fd24dc88 size: 28 previous size: 18 (Allocated) Ntfn
fd24dcb0 size: 128 previous size: 28 (Allocated) PTrk
fd24ddd8 size: 228 previous size: 128 (Allocated) tdLL
I’ll dump out the contents of the allocation using the dc command starting at the pool header for this block of memory. Remember, we expect to move a value from [esi+68] into eax. Later the code dereferences [eax+18] which leads me to believe that eax is the base of a structure. So we expect a valid Kernel mode value to be moved into eax rather than something like a string, otherwise the code wouldn’t dereference an offset.
2: kd> dc fd24d578
fd24d578 0a4e0003 e5726854 00000003 00000002 ..N.Thr.........
fd24d588 eb10ee70 20000000 e08b5c60 eb136f96 p...... `\...o..
fd24d598 006e0006 00000000 fd24d5a0 fd24d5a0 ..n.......$...$.
fd24d5a8 fd24d5a8 fd24d5a8 f4707000 f4704000 ..$...$..pp..@p.
fd24d5b8 f4706d48 00000000 fd24d700 fd24d700 Hmp.......$...$.
fd24d5c8 fd24d5c8 fd24d5c8 fd270290 01000100 ..$...$...'.....
fd24d5d8 00000002 00000000 00000001 01000a02 ................
fd24d5e8 00000000 fd24d640 32110000 0200009f ....@.$....2....
2: kd> dc
fd24d5f8 00000000 20202020 32313532 000a6953 .... 25125Si.. <-- appears to be a string.
fd24d608 20202020 20202020 20202020 5c4e4556 VEN\
fd24d618 32313532 20202020 20202020 20202020 2512
fd24d628 00000000 00000000 00000000 00000000 ................
fd24d638 00000000 00000000 fd24d78c fd24d78c ..........$...$.
fd24d648 00000000 fd24d784 fd24d640 30010000 ......$.@.$....0
fd24d658 00343033 00000000 00000000 00000000 304.............
fd24d668 00000000 01000000 00000000 00000000 ................
2: kd> dc
fd24d678 fd24d598 00000000 00000000 00000000 ..$.............
fd24d688 fd24d618 fd24d618 fd24d598 fd24d610 ..$...$...$...$.
fd24d698 00000000 00010102 00000000 00000000 ................
fd24d6a8 00000000 00000000 e08aeee0 00000000 ................
fd24d6b8 00000801 0000000f fd270290 0000000f ..........'.....
fd24d6c8 fd24d5c0 fd24d6d0 00000000 00000000 ..$...$.........
fd24d6d8 00000000 00000000 00000000 00000000 ................
fd24d6e8 00000000 00000000 f4707000 06300612 .........pp...0.
Examining the memory contents above you can clearly see a string overwrite starting around 0xfd24d5f8. The memory we dereferenced, fd24d600 or [esi+68], is right in the middle of the string. The string appears to be a vendor number for a piece of hardware. After examining the setupapi.log and the OEM**.inf files in the Windows\inf directory we found a similar string and narrowed it down to a third party.
A little more on the !pool command is important to mention. The memory address of interest may not always be allocated from the pool in which case you would encounter a message similar to this.
0: kd> !pool 80000ae5
Pool page 80000ae5 region is Unknown
80000000 is not a valid large pool allocation, checking large session pool...
80000000 is freed (or corrupt) pool
Bad allocation size @80000000, too large
***
*** An error (or corruption) in the pool was detected;
*** Pool Region unknown (0xFFFFFFFF80000000)
***
*** Use !poolval 80000000 for more details.
***
If this had been the case I would have enabled Special Pool to narrow down the culprit.
Comments
Anonymous
May 14, 2008
Great Post. Great job explaining the concepts behind the analysis. Thanks.Anonymous
May 14, 2008
Pretty cool.. I guess you're lucky that the driver was leaving its calling card on the spot where it's corrupting memory. That's pretty rare.Anonymous
May 18, 2008
Awesome Blog Post! Keep up the great work!