x64 Manual Stack Reconstruction and Stack Walking

My name is Trey Nash and I am an Escalation Engineer on the Core OS team. My experience is as a software developer, and therefore my blog posts tend to be slanted in the direction of helping developers during the feature development, testing and the support phases.

In this installment I would like to expand a bit on a previous post of mine called Challenges of Debugging Optimized x64 Code. In that post I discussed the nuances of the x64 calling convention (thankfully of which there is only one) and how it is used in optimized builds of software. The calling convention is sometimes referred to as the Application Binary Interface (ABI). In this post, I would like to discuss the x64 unwind metadata and how you can use it in the debugger to manually walk a stack.

In some cases, you may have a corrupted stack that the debugger simply cannot effectively walk for you. This often happens because the debugger walks a stack from the top down (assuming the stack grows upwards as if it were a stack of places on a table), and if the stack is sufficiently trashed then the debugger cannot find its bearing. In the x86 world, a large percentage of the time, you can spot the stack frames by following the chain of base pointers and then build a crafty stack backtrace command to display the stack at some point in time. But in the x64 calling convention there is no base pointer. In fact, once a function’s prolog code has executed the rsp register generally never changes until the epilog code. To read more about x64 prolog and epilog code conventions, go here.

Moreover, the syntax for creating a crafty stack backtrace command in the x64 environment is currently undocumented, and I aim to shed some light on that near the end of that blog post. J

The Example Code

For this blog post I have used the following example C# code that requires the .NET 4.0 framework and can be easily built from a Visual Studio 2010 command prompt. You can find the code below:

using System;
using System.Numerics;
using System.Threading;
using System.Threading.Tasks;
using System.Collections.Concurrent;

class EntryPoint
{
const int FactorialsToCompute = 2000;

static void Main() {
var numbers = new ConcurrentDictionary<BigInteger, BigInteger>(4, FactorialsToCompute);

// Create a factorial delegate.
Func<BigInteger, BigInteger> factorial = null;
factorial = (n) => ( n == 0 ) ? 1 : n * factorial(n-1);

// Now compute the factorial of the list
// concurrently.
Parallel.For( 0,
FactorialsToCompute,
(i) => {
numbers[i] = factorial(i);
} );
}
}

The spirit of this code is to concurrently compute the first 2000 factorials and store the results in a dictionary. This code uses the new Task Parallel Library to distribute this work evenly across the multiple cores on the system. To compile the example (assuming the code is stored in test.cs), you can execute the following command from a Visual Studio 2010 command prompt:

csc /r:system.numerics.dll test.cs

Note: If you are using a 64bit platform, be sure to use the x64 command prompt shortcut installed by the Visual Studio 2010 installer.

You can download a free evaluation of Visual Studio 2010 here.

x64 Unwind Metadata

So how does the debugger and functions such as RtlVirtualUnwind know how to walk the x64 stack if it cannot find a base pointer? The secret is that it uses unwind metadata that is typically baked into the Portable Executable (PE) file at link time. You can inspect this information using the /UNWINDINFO option of the command line tool dumpbin. For example, I went to the directory on my machine which contains clr.dll (c:\Windows\Microsoft.NET\Framework\v4.0.30319) and dumped the unwind info looking for CLREvent::WaitEx, which I have pasted below:

  00013F20 000DFDB0 000DFE3C 007267D8 ?WaitEx@CLREvent@@QEAAKKW4WaitMode@@PEAUPendingSync@@@Z (public: unsigned long __cdecl CLREvent::WaitEx(unsigned long,enum WaitMode,struct PendingSync *))
Unwind version: 1
Unwind flags: UHANDLER
Size of prologue: 0x20
Count of codes: 10
Unwind codes:
20: SAVE_NONVOL, register=rbp offset=0xB0
1C: SAVE_NONVOL, register=rbx offset=0xA8
0F: ALLOC_SMALL, size=0x70
0B: PUSH_NONVOL, register=r14
09: PUSH_NONVOL, register=r13
07: PUSH_NONVOL, register=r12
05: PUSH_NONVOL, register=rdi
04: PUSH_NONVOL, register=rsi
Handler: 0020ADF0 __CxxFrameHandler3
EH Handler Data: 007B3F54

I’ll get into what all of this means shortly.

Note: The dumpbin.exe functionality is also exposed via the linker. For example, the command “dumpbin.exe /?” is identical to “link.exe /dump /?”.

 

Within the debugger, you can find this same information for a particular function using the .fnent command. To demonstrate, I executed the example code within a windbg instance and broke in at some random point and chose one of the threads to look at which has a stack looking like the following:

  12 Id: f80.7f0 Suspend: 1 Teb: 000007ff`fffa0000 Unfrozen
# Child-SP RetAddr Call Site
00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

At first glance, it may appear that this stack is already trashed since there is no symbol information for the bottom frames in the display. Before jumping to this conclusion, recall that this is a managed application and therefore contains JIT compiled code. To verify that the addresses without symbol information are JIT’ed code, you can do a couple of things.

First, use the !EEHeap extension in the SOS extension to determine if these addresses reside in the JIT code heap. Below, you can see the commands I used to both load the SOS extension and then display the Execution Engine (EE) Heap information:

0:014> .loadby sos clr

0:014> !EEHeap -loader
Loader Heap:
--------------------------------------
System Domain: 000007fef50955a0
LowFrequencyHeap: 000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
StubHeap: 000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
Virtual Call Stub Heap:
IndcellHeap: 000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
LookupHeap: 000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap: 000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap: 000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size: Size: 0x8000 (32768) bytes.
--------------------------------------
Shared Domain: 000007fef5095040
LowFrequencyHeap: 000007ff00020000(2000:1000) Size: 0x1000 (4096) bytes.
HighFrequencyHeap: 000007ff00022000(8000:1000) Size: 0x1000 (4096) bytes.
StubHeap: 000007ff0002a000(2000:2000) Size: 0x2000 (8192) bytes.
Virtual Call Stub Heap:
IndcellHeap: 000007ff000d0000(6000:1000) Size: 0x1000 (4096) bytes.
LookupHeap: 000007ff000dc000(4000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap: 000007ff00106000(3a000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap: 000007ff000e0000(26000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size: Size: 0x8000 (32768) bytes.
--------------------------------------
Domain 1: 00000000003e73c0
LowFrequencyHeap: 000007ff00030000(2000:1000) 000007ff00140000(10000:5000) Size: 0x6000 (24576) bytes total, 0x1000 (4096) bytes wasted.
HighFrequencyHeap: 000007ff00032000(8000:5000) Size: 0x5000 (20480) bytes.
StubHeap: Size: 0x0 (0) bytes.
Virtual Call Stub Heap:
IndcellHeap: 000007ff00040000(4000:1000) Size: 0x1000 (4096) bytes.
LookupHeap: 000007ff0004b000(2000:1000) Size: 0x1000 (4096) bytes.
ResolveHeap: 000007ff0007c000(54000:1000) Size: 0x1000 (4096) bytes.
DispatchHeap: 000007ff0004d000(2f000:1000) Size: 0x1000 (4096) bytes.
CacheEntryHeap: Size: 0x0 (0) bytes.
Total size: Size: 0xf000 (61440) bytes total, 0x1000 (4096) bytes wasted.
--------------------------------------
Jit code heap:
LoaderCodeHeap: 000007ff00150000(40000:2000) Size: 0x2000 (8192) bytes.
Total size: Size: 0x2000 (8192) bytes.
--------------------------------------
Module Thunk heaps:
Module 000007fee5581000: Size: 0x0 (0) bytes.
Module 000007ff000330d8: Size: 0x0 (0) bytes.
Module 000007fef06f1000: Size: 0x0 (0) bytes.
Total size: Size: 0x0 (0) bytes.
--------------------------------------
Module Lookup Table heaps:
Module 000007fee5581000: Size: 0x0 (0) bytes.
Module 000007ff000330d8: Size: 0x0 (0) bytes.
Module 000007fef06f1000: Size: 0x0 (0) bytes.
Total size: Size: 0x0 (0) bytes.
--------------------------------------
Total LoaderHeap size: Size: 0x21000 (135168) bytes total, 0x1000 (4096) bytes wasted.
=======================================

I have highlighted the JIT heap information and you can see that the JIT’ed code instruction pointers in the stack fall within this range.

The second sanity check you can perform is to use a variant of the u instruction to confirm that there is a call instruction just prior to that address as shown below:

0:012> ub 0x7ff`0015098c
000007ff`0015095e 488b01 mov rax,qword ptr [rcx]
000007ff`00150961 48898424b0000000 mov qword ptr [rsp+0B0h],rax
000007ff`00150969 488b4108 mov rax,qword ptr [rcx+8]
000007ff`0015096d 48898424b8000000 mov qword ptr [rsp+0B8h],rax
000007ff`00150975 4c8d8424b0000000 lea r8,[rsp+0B0h]
000007ff`0015097d 488b5308 mov rdx,qword ptr [rbx+8]
000007ff`00150981 488d8c24c0000000 lea rcx,[rsp+0C0h]
000007ff`00150989 ff5318 call qword ptr [rbx+18h]

So at this point we have verified that we probably have a valid stack. But how does the debugger so effectively walk this stack for us if there is no stack frame pointer? The answer, of course, is that it uses the unwind information.

To explore the answer to that question, let’s focus on a particular frame within the stack such as frame 4 in the stack above. The code at that frame is inside the function clr!CLREvent::WaitEx, and if we pass that to .fnent, we get the following output:

0:012> .fnent clr!CLREvent::WaitEx
Debugger function entry 00000000`04075e40 for:
(000007fe`f48bfdb0) clr!CLREvent::WaitEx | (000007fe`f48bfe3c) clr!CLREvent::Set
Exact matches:
clr!CLREvent::WaitEx = <no type information>

BeginAddress = 00000000`000dfdb0
EndAddress = 00000000`000dfe3c
UnwindInfoAddress = 00000000`007267d8

Unwind info at 000007fe`f4f067d8, 20 bytes
version 1, flags 2, prolog 20, codes a
frame reg 0, frame offs 0
handler routine: clr!_CxxFrameHandler3 (000007fe`f49eadf0), data 7b3f54
00: offs 20, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: b0
02: offs 1c, unwind op 4, op info 3 UWOP_SAVE_NONVOL FrameOffset: a8
04: offs f, unwind op 2, op info d UWOP_ALLOC_SMALL
05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL
06: offs 9, unwind op 0, op info d UWOP_PUSH_NONVOL
07: offs 7, unwind op 0, op info c UWOP_PUSH_NONVOL
08: offs 5, unwind op 0, op info 7 UWOP_PUSH_NONVOL
09: offs 4, unwind op 0, op info 6 UWOP_PUSH_NONVOL

 Notice that this output is virtually identical to the same information provided by dumpbin using the /UNWINDINFO option.

I have highlighted two interesting values above. The value highlighted in green is a relative virtual address (RVA) to the unwind info that is baked into the PE file by the linker. The value highlighted in yellow is the actual virtual address of the unwind info and can be computed by adding the module base address shown below to the RVA for UnwindInfoAddress.

0:012> lmnm clr

start end module name

000007fe`f47e0000 000007fe`f5145000 clr

By examining the PE header using !dh you can confirm that the unwind information resides in the .rdata section of the module, which I have shown below:

0:012> !dh clr

File Type: DLL
FILE HEADER VALUES
8664 machine (X64)
6 number of sections
4BA21EEB time date stamp Thu Mar 18 07:39:07 2010

<snip>
SECTION HEADER #2
.rdata name
1FC8EC virtual size
67F000 virtual address
1FCA00 size of raw data
67E200 file pointer to raw data
0 file pointer to relocation table
0 file pointer to line numbers
0 number of relocations
0 number of line numbers
40000040 flags
Initialized Data
(no align specified)
Read Only
<snip>

Using the Unwind Info

Now let’s take a look at the unwind info and compare it to the prolog code of the function with which it is associated. For convenience, I have reprinted the .fnent output for the function:

0:012> .fnent clr!CLREvent::WaitEx
Debugger function entry 00000000`04075e40 for:
(000007fe`f48bfdb0) clr!CLREvent::WaitEx | (000007fe`f48bfe3c) clr!CLREvent::Set
Exact matches:
clr!CLREvent::WaitEx = <no type information>

BeginAddress = 00000000`000dfdb0
EndAddress = 00000000`000dfe3c
UnwindInfoAddress = 00000000`007267d8

Unwind info at 000007fe`f4f067d8, 20 bytes
version 1, flags 2, prolog 20, codes a
frame reg 0, frame offs 0
handler routine: clr!_CxxFrameHandler3 (000007fe`f49eadf0), data 7b3f54
  00: offs 20, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: b0
02: offs 1c, unwind op 4, op info 3 UWOP_SAVE_NONVOL FrameOffset: a8 04: offs f, unwind op 2, op info d UWOP_ALLOC_SMALL

  05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL
06: offs 9, unwind op 0, op info d UWOP_PUSH_NONVOL
07: offs 7, unwind op 0, op info c UWOP_PUSH_NONVOL
08: offs 5, unwind op 0, op info 7 UWOP_PUSH_NONVOL
09: offs 4, unwind op 0, op info 6 UWOP_PUSH_NONVOL

The yellow highlighted value tells us that the prolog code for the function is 0x20 bytes in length. Using that information we can dump out the prolog code for the function:

0:012> u clr!CLREvent::WaitEx clr!CLREvent::WaitEx+20
clr!CLREvent::WaitEx:
000007fe`f48bfdb0 488bc4 mov rax,rsp
000007fe`f48bfdb3 56 push rsi
000007fe`f48bfdb4 57 push rdi
000007fe`f48bfdb5 4154 push r12
000007fe`f48bfdb7 4155 push r13
000007fe`f48bfdb9 4156 push r14
000007fe`f48bfdbb 4883ec70 sub rsp,70h
000007fe`f48bfdbf 48c7442440feffffff mov qword ptr [rsp+40h],0FFFFFFFFFFFFFFFEh
000007fe`f48bfdc8 48895810 mov qword ptr [rax+10h],rbx
000007fe`f48bfdcc 48896818 mov qword ptr [rax+18h],rbp

The list of operations in the unwind info is listed in the reverse order of the operations in the assembly code. Each of the UWOP_PUSH_NONVOL operations in the unwind info maps to a nonvolatile register that is pushed onto the stack for safe keeping in the prolog code. I have highlighted the sections within the prolog and the .fnent output such that highlighting with like colors indicates related information. Now, let’s take a look at the raw stack and tie all of this information together.

Below is the stack with the frame we are focusing on highlighted in yellow:

0:012> kn
# Child-SP RetAddr Call Site
00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

Note: The symbols above look a little weird and may lead you to believe that WaitEx is calling itself recursively, but it is not. It only appears that way because you need the private symbols for clr.dll to be able to see the real function name. Only public symbols are available outside of Microsoft.

 

And below is the raw stack relevant to this frame with some highlighting and annotations that I have added:

0:012> dps 00000000`04a51f60-10 L20
00000000`04a51f50 00000000`00000001
00000000`04a51f58 000007fe`f48bfe23 clr!CLREvent::WaitEx+0x5e
00000000`04a51f60 00000000`c0402388
00000000`04a51f68 00000000`c0402500
00000000`04a51f70 000007fe`f48afaf0 clr!SystemNative::ArrayCopy
00000000`04a51f78 00000000`00000182
00000000`04a51f80 00000000`04a521d0
00000000`04a51f88 000007fe`00000001
00000000`04a51f90 00000000`00000057
00000000`04a51f98 00000000`c0402398
00000000`04a51fa0 ffffffff`fffffffe
00000000`04a51fa8 007f0000`04a521d0
00000000`04a51fb0 fffff880`009ca540
00000000`04a51fb8 000007fe`f483da5b clr!SVR::heap_select::select_heap+0x1c
00000000`04a51fc0 fffff880`009ca540
00000000`04a51fc8 000007fe`fd4e18aa KERNELBASE!ResetEvent+0xa
00000000`04a51fd0 00000000`0043dc60
00000000`04a51fd8 00000000`00000178
00000000`04a51fe0 00000000`00493c10
00000000`04a51fe8 00000000`0043dc60 ß saved rdi
00000000`04a51ff0 00000000`00000001

*** call into clr!CLREvent::WaitEx ***

00000000`04a51ff8 000007fe`f48d51d8 clr!SVR::gc_heap::wait_for_gc_done+0x98
00000000`04a52000 00000000`00493ba0
00000000`04a52008 00000000`00493ba0 ß saved rbx
00000000`04a52010 00000000`00000058 ß saved rbp
00000000`04a52018 000007fe`f0711e0f System_Numerics_ni+0x21e0f
00000000`04a52020 00000000`00000178
00000000`04a52028 000007fe`f4995249 clr!SVR::GCHeap::Alloc+0xb4
00000000`04a52030 00000000`0043a140
00000000`04a52038 00000000`0043dc60
00000000`04a52040 00000000`00000000
00000000`04a52048 00000000`04a522e0

In the stack listing I have used the same color highlighting scheme as before to show how the data on the raw stack correlates to the unwind data. And, using green highlighting, I have shown how the Child-SP value correlates to the stack frame.

The cyan highlighting represents nonvolatile registers that are pushed onto the stack in the prolog code. The blue highlighting represents stack space reserved for locals and for register home space allocated for calling sub routines. In the unwind data the stack reservation is represented by a UWOP_ALLOC_SMALL operation. And the red highlighting represents nonvolatile registers that are stored in the home space of the previous stack frame and represented by a UWOP_SAVE_NONVOL operation stored in the unwind information.

As you can see, we have all of the information we need in the unwind data to determine which slots on the stack are used for what. The only thing we don’t know is the partitioning of the reserved stack space for locals, which is described by the private symbol information for the clr.dll module.

Tying it all Together

.fnent produces its output directly from parsing the definition of the UNWIND_INFO structure and it even gives you the address of where that structure lives in memory. The UNWIND_INFO structure also contains a variable amount of UNWIND_CODE structures. You can find details of the structure definitions for UNWIND_INFO and UNWIND_CODE here. Each parsed line of unwind information in the .fnent output is backed by at least one of these structures. In fact, you can see the correlation between the structure fields for UNWIND_INFO and the data in the .fnent output as shown below:

From UNWIND_CODE:

UBYTE

Offset in prolog

UBYTE: 4

Unwind operation code

UBYTE: 4

Operation info

From .fnent:

05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL

The meaning of the OpInfo (operation info) field is dependent on the UnwindOp (unwind operation code) field and is spelled out in the documentation for UNWIND_CODE. For example, for the UWOP_PUSH_NONVOL operation shown above, the OpInfo field is an index into the following table, which indicates which nonvolatile register this push is associated with. Note that the values in the below table are in decimal, while the .fnent values are in hex:

0

RAX

1

RCX

2

RDX

3

RBX

4

RSP

5

RBP

6

RSI

7

RDI

8 to 15

R8 to R15

Therefore, the previous line from the .fnent output represents a push operation for the r14 register (05: offs b, unwind op 0, op info e UWOP_PUSH_NONVOL). Looking at the assembly above, we see that the topmost UWOP_PUSH_NONVOL operation in the .fnent output correlates to the last nonvolatile register push in the prolog code (push r14).

Note: Remember, the push operations in the .fnent output are listed in the reverse order of where they are in the actual prolog code. This helps the unwind code easily calculate offsets of where they should live in the stack.

One thing that you will notice in the x64 calling convention is that once the prolog code has executed, the value for rsp will very rarely change. The Child-SP value in the stack displayed by the k commands is the value of rsp for that frame after the prolog code has executed. So the offsets to access these nonvolatile registers are then applied to the Child-SP value (previously highlighted in green) to find where they live on the stack. So, in a way, the Child-SP value acts like the base pointer we are used to on the x86 platform.

In the .fnent output above, you will also see the following:

00: offs 20, unwind op 4, op info 5 UWOP_SAVE_NONVOL FrameOffset: b0

For UWOP_SAVE_NONVOL, you see that the .fnent output shows us the offset where we can find this register, and the register in question is represented by the OpInfo value that equates to rbp. The offset above is applied to the Child-SP value (00000000`04a51f60 in this case) to produce the address 00000000`04a52010, which indicates that’s where we can find a saved copy of rbp. I have also annotated where it lives in the raw stack output shown previously.

Note: If you’re wondering why rbp is stored in the previous stack frame, check out my previous post on this topic where I describe how in optimized builds, the compiler can use the home space from the previous stack frame to save nonvolatile registers thus saving them with a MOV operation as opposed to a PUSH operation. This is possible because in optimized builds the home space is not necessarily used to store parameters.

So how does all of This Work for CLR JIT Code?

If you have asked this question, then you are definitely paying attention! As we have shown, the compiler and linker are responsible for placing unwind info in the Portable Executable file at build time. But what about dynamic code that is generated at runtime? Certainly there must be unwind information for dynamically compiled code as well, otherwise there would be no way to walk the stack or unwind the stack after an exception.

As it turns out, APIs exist for this very situation, including RtlAddFunctionTable and RtlInstallFunctionTableCallback. In fact, the CLR uses RtlInstallFunctionTableCallback. The generated unwind information is then rooted in a linked list where the head is at ntdll!RtlpDynamicFunctionTable. The format of the linked list items is undocumented as it is an implementation detail, but using dbghelp.dll you can find the unwind information for a given instruction pointer if you so desire by calling SymFunctionTableAccess64.

In fact, if you want to see the CLR adding dynamic unwind info in action you can run the test code above under the debugger, and then at the initial breakpoint, before the application starts running, set the following breakpoint:

bu ntdll!RtlInstallFunctionTableCallback

When you let the application run you should then end up with a call stack at the breakpoint that looks like the following, which clearly shows the JIT compiler adding the unwind info to the table dynamically:

0:000> kn
# Child-SP RetAddr Call Site
00 00000000`0017dca8 000007fe`f4832cc6 ntdll!RtlInstallFunctionTableCallback
01 00000000`0017dcb0 000007fe`f4831422 clr!InstallEEFunctionTable+0x77
02 00000000`0017df60 000007fe`f4828ca8 clr!StubLinker::EmitUnwindInfo+0x492
03 00000000`0017e050 000007fe`f4832c1a clr!StubLinker::EmitStub+0xe8
04 00000000`0017e0b0 000007fe`f48328e5 clr!StubLinker::LinkInterceptor+0x1ea
05 00000000`0017e160 000007fe`f4831e40 clr!CTPMethodTable::CreateStubForNonVirtualMethod+0xa35
06 00000000`0017e300 000007fe`f4832926 clr!CRemotingServices::GetStubForNonVirtualMethod+0x50
07 00000000`0017e3c0 000007fe`f48223f3 clr!MethodDesc::DoPrestub+0x38b
08 00000000`0017e4d0 000007fe`f47e2d07 clr!PreStubWorker+0x1df
09 00000000`0017e590 000007fe`f48210b4 clr!ThePreStubAMD64+0x87
0a 00000000`0017e660 000007fe`f48211c9 clr!CallDescrWorker+0x84
0b 00000000`0017e6d0 000007fe`f4821245 clr!CallDescrWorkerWithHandler+0xa9
0c 00000000`0017e750 000007fe`f4823cf1 clr!MethodDesc::CallDescr+0x2a1
0d 00000000`0017e9b0 000007fe`f49cdc3d clr!MethodDescCallSite::Call+0x35
0e 00000000`0017e9f0 000007fe`f4999f0d clr!AppDomain::InitializeDomainContext+0x1ac
0f 00000000`0017ebf0 000007fe`f49212a1 clr!SystemDomain::InitializeDefaultDomain+0x13d
10 00000000`0017f0c0 000007fe`f4923dd6 clr!SystemDomain::ExecuteMainMethod+0x191
11 00000000`0017f670 000007fe`f4923cf3 clr!ExecuteEXE+0x43
12 00000000`0017f6d0 000007fe`f49a7365 clr!CorExeMainInternal+0xc4
13 00000000`0017f740 000007fe`f8ad3309 clr!CorExeMain+0x15

But there is one more wrinkle to this picture. We now know that by using RtlInstallFunctionTableCallback the CLR, or any other JIT engine, can register a callback that provides the unwind information at runtime. But how does the debugger access this information? When the debugger is broken into the process or if you are debugging a dump, it cannot execute the callback function registered with RtlInstallFunctionTableCallback.

This is where the sixth and final parameter to RtlInstallFunctionTableCallback comes into play. By providing the OutOfProcessCallbackDll parameter, the CLR is providing a dll which the debugger can use to effectively parse through the JITer’s unwind information statically. When inspecting which path the CLR passes for OutOfProcessCallbackDll on my machine, I see the following string:

0:000> du /c 80 000007fe`f5916160
000007fe`f5916160 "C:\Windows\Microsoft.NET\Framework64\v4.0.30319\mscordacwks.dll"

So, the debugger uses mscordacwks.dll to statically examine the unwind info while the process is broken in the debugger or while inspecting a dump.

Note: This is one of the many reasons why you must have a complete process dump to effectively post-mortem debug managed applications.

Using the ‘k =’ Command to Dump the Stack

If you look at the documentation for the k command, you’ll see that there is a way to override the base pointer when walking the stack. However, the documentation leaves it a complete mystery as to how to apply this in the x64 world. To demonstrate what I mean, consider the following stack from earlier:

0:012> kn
# Child-SP RetAddr Call Site
00 00000000`04a51e18 000007fe`fd4e10ac ntdll!NtWaitForSingleObject+0xa
01 00000000`04a51e20 000007fe`f48bffc7 KERNELBASE!WaitForSingleObjectEx+0x79
02 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
03 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
04 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
05 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
06 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
07 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
08 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
09 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
0a 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
0b 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0c 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
10 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
11 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c
12 00000000`04a52fa0 000007ff`0015098c 0x7ff`0015098c
13 00000000`04a53150 000007ff`0015098c 0x7ff`0015098c

Now, imagine the top of the stack is corrupted, which I have “simulated” by blacking out the top few frames in this stack dump. Furthermore, let’s assume that we identified a frame where the stack starts to look sane again by looking at the raw stack below:

0:012> dps 00000000`04a51e90
00000000`04a51e90 00000000`00000000
00000000`04a51e98 00000000`04a52130
00000000`04a51ea0 00000000`ffffffff
00000000`04a51ea8 00000000`ffffffff
00000000`04a51eb0 00000000`00000108
00000000`04a51eb8 000007fe`f48bffc7 clr!CLREvent::WaitEx+0x170
00000000`04a51ec0 00000000`00000000
00000000`04a51ec8 00000000`00000108
00000000`04a51ed0 000007fe`00000000
00000000`04a51ed8 00000000`00000108
00000000`04a51ee0 ffffffff`fffffffe
00000000`04a51ee8 00000000`00000001
00000000`04a51ef0 00000000`00000000
00000000`04a51ef8 000007fe`f48bff70 clr!CLREvent::WaitEx+0xf8
00000000`04a51f00 00000000`00000000
00000000`04a51f08 00000000`00493ba0

From looking at this stack, we can see the typical pattern of stack frames because the return addresses resolve to symbols of sorts.

To dump out the corrupted stack, here is the undocumented syntax for the x64 platform:

k = <rsp> <rip> <frame_count>

<rsp> is the stack pointer to start with. You want to use the stack pointer that would have been in rsp when that function was active. Remember, typically rsp does not change after the function prolog code completes. Therefore, if you pick the stack pointer just below the return address, you should be good.

<rip> should be an instruction pointer from within the function that was executing at the time the <rsp> value above was in play. In this case, the return address directly above <rsp> comes from that function and I have highlighted it in green. This piece of information is critical so that the debugger can find the unwind metadata for the function that was current at this point in the stack. Without it, the debugger cannot walk the stack.

Armed with this information, you can construct a k command to dump the stack starting from this frame as shown below:

0:012> kn = 00000000`04a51ec0 000007fe`f48bffc7 10
# Child-SP RetAddr Call Site
00 00000000`04a51ec0 000007fe`f48bff70 clr!CLREvent::WaitEx+0x170
01 00000000`04a51f00 000007fe`f48bfe23 clr!CLREvent::WaitEx+0xf8
02 00000000`04a51f60 000007fe`f48d51d8 clr!CLREvent::WaitEx+0x5e
03 00000000`04a52000 000007fe`f4995249 clr!SVR::gc_heap::wait_for_gc_done+0x98
04 00000000`04a52030 000007fe`f48aef28 clr!SVR::GCHeap::Alloc+0xb4
05 00000000`04a520a0 000007fe`f48aecc9 clr!FastAllocatePrimitiveArray+0xc5
06 00000000`04a52120 000007fe`f071244c clr!JIT_NewArr1+0x389
07 00000000`04a522f0 000007fe`f07111b5 System_Numerics_ni+0x2244c
08 00000000`04a52330 000007ff`00150acf System_Numerics_ni+0x211b5
09 00000000`04a523d0 000007ff`0015098c 0x7ff`00150acf
0a 00000000`04a52580 000007ff`0015098c 0x7ff`0015098c
0b 00000000`04a52730 000007ff`0015098c 0x7ff`0015098c
0c 00000000`04a528e0 000007ff`0015098c 0x7ff`0015098c
0d 00000000`04a52a90 000007ff`0015098c 0x7ff`0015098c
0e 00000000`04a52c40 000007ff`0015098c 0x7ff`0015098c
0f 00000000`04a52df0 000007ff`0015098c 0x7ff`0015098c

Note: The frame count in the above k expression is required. That is the way the debugger engine distinguishes between this variant of the command (with an overridden rip) and the documented form of k that does not provide an overridden rip.

Conclusion

Since the x64 calling convention does not utilize a base pointer (among other things), we need some extra information to effectively walk the stack. That extra information comes in the form of unwind metadata and is generated by the compiler and linker for static code and baked into the portable executable file. If you happen to code in assembly language, there are various macros that you must use to decorate your assembly code so that the assembler can generate the proper unwind metadata. For dynamically compiled code, that information is instead provided at runtime by registering a callback with the system. Knowing this information is critical if you encounter a corrupted stack and must piece it together manually. In such situations you’ll need to know how to dig out the unwind metadata manually and use it to effectively reconstruct the call stack.

That said, you could spare yourself some effort and use the undocumented variant of the k command described above to dump the stack starting at any frame. J

Happy debugging everyone!

"The example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious. No association with any real company, organization, product, domain name, email address, logo, person, places, or events is intended or should be inferred."

Comments

  • Anonymous
    October 27, 2011
    Very nice! Thanks a lot for this detailed explanation!

  • Anonymous
    October 31, 2012
    nice article. i'm trying to walk the stack from a profiler for .NET using RtlLookFunctionTable and RtlVirtualUnwind. i dont see the entire stack also there is stack corruption . what am i doing worng ? [Although we are unable to provide 1:1 troubleshooting through this blog, if the stack is corrupted it would be expected that standard stack walking mechanisms would not work.  There are several options for support through http://support.microsoft.com/ which may be able to provide more detailed 1:1 assistance.]