When Special Pool is not so Special
Hi Everyone. Richard here in the UK GES team bringing you an interesting case we saw recently where Special Pool gave us some unexpected results, but ultimately still helped us track down the cause of the problem.
It started when we were looking in to some Blue Screen bugchecks on Windows 7 Client builds for a customer. We saw several different bugcheck codes (0xA, 0xD1 and 0x19 being the most common) and everything pointed towards a pool corruption.
We enabled Special Pool and Pool Tracking using Driver Verifier with the following parameters...
verifier /flags 0x9 /all
...and sat back, made a cup of tea, and waited for the new dumps to arrive.
When they did, it looked very promising. All were bugcheck 0xC1 similar to the following...
0: kd> .bugcheck
Bugcheck code 000000C1
Arguments ca0b4fe0 00000521 00000020 00000021
SPECIAL_POOL_DETECTED_MEMORY_CORRUPTION (c1)
Special pool has detected memory corruption. Typically the current thread's
stack backtrace will reveal the guilty party.
Arguments:
Arg1: ca0b4fe0, Address trying to free.
Arg2: 00000521, Size of the memory block, as recorded in the pool block header.
Arg3: 00000020, Size of the memory block, as computed based on the address being freed.
Arg4: 00000021, Caller is trying to free an incorrect Special Pool memory block.
It looked as if Verifier had done its magic and we were just a stack backtrace away from solving the issue. However, when we looked the stacks were all different, such as the two examples below...
1: kd> kL
# ChildEBP RetAddr
00 de0c5438 82eef4a3 nt!KeBugCheckEx+0x1e
01 de0c545c 82eeff89 nt!MiCheckSpecialPoolSlop+0x6e
02 de0c553c 82f29b90 nt!MmFreeSpecialPool+0x15b
03 de0c55a4 8313df90 nt!ExFreePoolWithTag+0xd6
04 de0c55b8 8e556044 nt!VerifierExFreePoolWithTag+0x30
05 de0c55c8 8e561814 dxgmms1!operator delete[]+0x16
06 de0c5608 8e56203d dxgmms1!VIDMM_GLOBAL::CloseOneAllocation+0x1a0
07 de0c5628 8e54532b dxgmms1!VIDMM_GLOBAL::CloseAllocation+0x37
08 de0c5638 9736b4e5 dxgmms1!VidMmCloseAllocation+0x13
09 de0c56b4 9736a9ca dxgkrnl!DXGDEVICE::DestroyAllocations+0x19a
0a de0c56d4 97361ab9 dxgkrnl!DXGDEVICE::DestroyResource+0x4d
0b de0c56f4 97364ede dxgkrnl!DXGDEVICE::ProcessTerminationList+0x7a
0c de0c5904 9736761e dxgkrnl!DXGCONTEXT::Present+0x1c3
0d de0c5c18 9f01c011 dxgkrnl!DxgkPresent+0x2dd
0e de0c5c28 82e45d86 win32k!NtGdiDdDDIPresent+0x19
0f de0c5c28 77d16b94 nt!KiSystemServicePostCall
WARNING: Frame IP not in any known module. Following frames may be wrong.
10 001fe3f0 00000000 0x77d16b94
0: kd> kL
# ChildEBP RetAddr
00 b5a078dc 82f25e99 nt!KeBugCheckEx+0x1e
01 b5a079cc 82f5fb90 nt!MmFreeSpecialPool+0x6c
02 b5a07a34 83173f90 nt!ExFreePoolWithTag+0xd6
03 b5a07a48 8dd7754b nt!VerifierExFreePoolWithTag+0x30
04 b5a07a64 8dd7a54d Npfs!NpRemoveDataQueueEntry+0x81
05 b5a07a84 8dd7623f Npfs!NpSetClosingPipeState+0x11f
06 b5a07ab8 8dd762e3 Npfs!NpCommonCleanup+0xf5
07 b5a07ad0 831724d9 Npfs!NpFsdCleanup+0x19
08 b5a07af4 82e75072 nt!IovCallDriver+0x73
09 b5a07b08 83071fd7 nt!IofCallDriver+0x1b
0a b5a07b48 8306344d nt!IopCloseFile+0x2f3
0b b5a07b94 83084a95 nt!ObpDecrementHandleCount+0x139
0c b5a07bdc 830847d5 nt!ObpCloseHandleTableEntry+0x203
0d b5a07c0c 83084b6f nt!ObpCloseHandle+0x7f
0e b5a07c28 82e7bd86 nt!NtClose+0x4e
0f b5a07c28 77736b94 nt!KiSystemServicePostCall
WARNING: Frame IP not in any known module. Following frames may be wrong.
10 00defd44 00000000 0x77736b94
Usually we expect to see the guilty driver in the stack, but in this case we see multiple drivers in different stacks, but all bugchecking when we try to free the allocation.
If we dump back a little from the first parameter we can see the slop pattern, the allocation (of size 0x20) and the following guard page…
0: kd> dc ca0b4fe0-30
ca0b4fb0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS<
ca0b4fc0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4fd0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4fe0 cb55f91c cb55f91c b5e70c50 00000000 ..U...U.P.......
ca0b4ff0 00000000 00000400 00000400 53535353 ............SSSS
ca0b5000 ???????? ???????? ???????? ???????? ????????????????
ca0b5010 ???????? ???????? ???????? ???????? ????????????????
ca0b5020 ???????? ???????? ???????? ???????? ????????????????
The start of this page should show a 12 byte header, followed by the start of the slop bytes such as this example taken from an intact page in the same dump…
0: kd> dc ca0b2000
ca0b2000 001d4064 6c734d46 c7a66250 1d1d1d1d <d@..FMslPb>......
ca0b2010 1d1d1d1d 1d1d1d1d 1d1d1d1d 1d1d1d1d ................
ca0b2020 1d1d1d1d 1d1d1d1d 1d1d1d1d 1d1d1d1d ................
ca0b2030 1d1d1d1d 1d1d1d1d 1d1d1d1d 1d1d1d1d ................
…but what we see at the start of the damaged pages is…
0: kd> dc ca0b4000
ca0b4000 827f0521 40a47703 d610c641 01800011 !....w.@A.......
ca0b4010 9073f69c 900f7422 48d9077a 2667a442 ..s."t..z..HB.g&
ca0b4020 001c955c 42893020 8c3cc405 02051bb8 \... 0.B..<.....
ca0b4030 c13675c1 52055328 73985e4a 041877a1 .u6.(S.RJ^.s.w..
…garbage.
However, all is not lost. Further inspection shows that across all the 0xC1 dumps this “garbage” has a semi-predictable pattern…
0: kd> dc ca0b4000 ca0b4fff
ca0b4000 827f0521 40a47703 d610c641 01800011 !....w.@A.......
ca0b4010 9073f69c 900f7422 48d9077a 2667a442 ..s."t..z..HB.g&
ca0b4020 001c955c 42893020 8c3cc405 02051bb8 \... 0.B..<.....
ca0b4030 c13675c1 52055328 73985e4a 041877a1 .u6.(S.RJ^.s.w..
ca0b4040 665f093d e9bcfca4 efb0dded d65bf575 =._f........u.[.
ca0b4050 10e40107 441909c4 a058556d b1844702 .......DmUX..G..
ca0b4060 0ed49060 641b130c 86ba2210 82604002 `......d."...@`.
============================SNIP============================
ca0b48d0 70006000 00000000 00000000 00000000 .`.p............
ca0b48e0 74006000 00000000 00000000 00000000 .`.t............
ca0b48f0 78006000 00000000 00000000 00000000 .`.x............
ca0b4900 7c006000 00000000 00000000 00000000 .`.|............
ca0b4910 80006000 00000000 00000000 00000000 .`..............
ca0b4920 78090043 02000000 22220000 02400000 C..x......""..@.
ca0b4930 11130000 02400000 11130000 02c0000c ......@.........
ca0b4940 11110000 00000000 00000000 00000000 ................
ca0b4950 00000000 00000000 00000000 00000000 ................
============================SNIP============================
ca0b4a10 00000000 00000000 00000000 00000000 ................
ca0b4a20 00000000 00000000 00000000 00000000 ................
ca0b4a30 00000000 680b0000 00000000 00000000 .......h........
ca0b4a40 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4a50 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4a60 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4a70 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4a80 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
============================SNIP============================
ca0b4fb0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4fc0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4fd0 53535353 53535353 53535353 53535353 SSSSSSSSSSSSSSSS
ca0b4fe0 cb55f91c cb55f91c b5e70c50 00000000 ..U...U.P.......
ca0b4ff0 00000000 00000400 00000400 53535353 ............SSSS
Most of the garbage data towards the start of the page varies between corruptions, but some parts, such as THIS, were consistent across all dumps. So we chose this DWORD to search for in kernel memory (remembering to set .ignore_missing_pages first to keep screen-chaff to a minimum)…
0: kd> .ignore_missing_pages 1;s -d 80000000 L?7fffffff 78090043
Suppress kernel summary dump missing page error message
8ade97f8 78090043 02000000 22220000 02400000 C..x......""..@.
96f94038 78090043 00000000 20000000 1b000001 C..x....... ....
ca0b4920 78090043 02000000 22220000 02400000 C..x......""..@.
Always 3 hits in all the dumps: one for the corrupted pool page; one for the crashdump driver where it is dumping out this page; and the third one points to a 3rd party driver…
0: kd> lmta 96f94038
start end module name
96616000 97332000 DRIVER_X Tue Mar 27 03:05:05 2012 (4F712051)
Since these crashes were on 32-bit systems we could use the above search to check the entire range of Kernel memory for the pattern. However, since 64-bit systems are more common these days, even on Client systems, we could instead have narrowed this down to just searching the address space of the loaded modules...
0: kd> !for_each_module s -d @#Base @#End 78090043
96f94038 78090043 00000000 20000000 1b000001 C..x....... ....
Whichever method you prefer, an update to the third-party DRIVER_X.SYS resolved the issue, and there was much rejoicing.