x64 ABI vs. x86 ABI (aka Calling Conventions for AMD64 & EM64T)

(This is an older post, with some mild cleanup, and fixed links)

Before I start: ABI = Application Binary Interface – this is the spec that describes how to call functions, pass parameters, unwind the stack, handle exceptions, etc... It’s also sometimes call the ‘Calling Convention’

There is a persistent misconception among people who are implementing x64 compilers/code generators, and folks that write ASM code for x64, who have a functioning solution for x86. People too frequently assume they can just keep most of their code, while changing ESP to RSP, and things should 'just work'. This is fundamentally not true. When initially working on the x64 ABI, it was decided that we wanted to clean up the way that exception handling & general function invocation worked. We had a brand new architecture, we wanted to cut out all of the legacy junk that continually prevents, or at least overcomplicates, achieving great performance on the x86 platform. With this in mind, x64 was given a single calling convention – no __cdecl / __stdcall / __fastcall / __thiscall mess. There was also a dramatic change in the way that x64 unwinds the stack, compared to how x86 does it.

Unwinding the stack is used in a wide variety of places, including handling exceptions, garbage collection, and displaying the call stack from a debugger. On x86, every function that needs some sort of attention due to an exception must add an element to a thread-global linked list upon entry, and remove it upon exit. For non-exception unwinds, the thing doing the unwind must grok through some nasty meta-data that tries to describe what & when the compiler is setting up the stack frame. This meta-data was only implemented after about 1999, and is primarily supported in debuggers. I’ll be ignoring this junk, and instead focus on how exception handling works, since this is the primary way your code will break if you don’t get this right (undebuggable code is still broken, but you won’t notice until you try to get a stack trace).

So, the x86 thread-global linked list contains a list of structures, each element of such contains a function pointer to call in the event of an exception, and then some data that said function will consume. Thus, you’ll see fs:[0] references scattered throughout C++ code:every function that contains a destructor that must be invoked if an exception occurs must have one of these things. When your code creates a new object, there is a small bit of code executed to update the data on this thread-global list element. When that object is destroyed, more code is executed. Because of this, x86 can catch hundreds of thousands of exceptions per second, but if you don’t actually take any exceptions, your CPU executes lots of blobs of code that have no purpose except to be sure that the rare exception is handled properly. Finding the first function that needs to handle an exception, or destroy an object is an O(1) operation: it’s a single lookup of FS:[0]. VERY fast. Unfortunately, that should not be a common scenario. Exceptions should be exceptional, not common!In addition, this linked list really resides on the stack, thus there is a function pointer sitting right below the return address on your stack!Buffer overruns, anyone?There is now an extra blob of information that indicates what functions are valid to be invoked as exception handlers (link /SAFESEH, if you’re curious), but this is only used if every contribution to your .exe or .dll has this information.

With this in mind, on x64 every function has a very strict structure that must be properly described in static data. The prolog is the only place in which you can adjust your stack frame pointer. The prolog can only be up to 255 bytes long. All modifications of your stack frame pointer, as well as all saves of nonvolatile registers (RBP, RSI, RDI, R12-R15, XMM6-XMM15) must be described in this static data, so that they can be restored correctly if an exception occurs. If you function is missing this static data, and an exception is raised, the thread will be terminated by the OS.

In an effort to get stuff into peoples hands, here’s an excerpt that I’ve prepended to the ABI document that has not yet seen the light of day. I’ve updated the links to point you

at the sections on MSDN. Rather than posting the whole thing, I've just put in my description, with links back to the slightly older version on MSDN. The current version has some more detail, but nothing you couldn't figure out through, ahem, trial & error...

<snip>

Overview

The in depth nature of an ABI document doesn’t lend itself to ‘easy reading’. However, it is the case that a detailed knowledge of the entire ABI is rarely necessary to accomplish most programming tasks. This section is simply a quick overview of the ABI, with pointers to the sections that describe the various aspects in more detail. It also tries to point out particular ‘gotchas’ that must be strictly adhered to, in an effort to minimize the problems encountered.

Calling convention

The x64 Windows ABI is a 4 register ‘fast-call’ calling convention, with stack-backing for those registers. There is a strict one-to-one correspondence between arguments in a function, and the registers for those arguments. Any argument that doesn’t fit in 8 bytes, or is not 1, 2, 4, or 8 bytes, must be passed by reference. There is no attempt to spread a single argument across multiple registers. The x87 register stack is unused. It may be used, but must be considered volatile across function calls. All floating point operations are done using the 16 XMM registers. The arguments are passed in registers RCX, RDX, R8, and R9. If the arguments are float/double, they are passed in XMM0L, XMM1L, XMM2L, and XMM3L. 16 byte arguments are passed by reference. Parameter passing is described in detail at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Parameter Passing. In addition to these registers, RAX, R10, R11, XMM4, and XMM5 are volatile. All other registers are non-volatile. Register usage is documented in detail at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Register Usage and MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Caller/Callee Saved Registers.

The caller is responsible for allocating space for parameters to the callee, and must always allocate sufficient space for the 4 register parameters, even if the callee doesn’t have that many parameters. This aids in the simplicity of supporting K&R C’s unprototyped functions, and ‘vararg’ C/C++ functions. For vararg/unprototyped functions any float values must be duplicated in the corresponding general-purpose register. Any parameters above the first 4 must be stored on the stack, above the backing-store for the first 4, prior to the call. Vararg function details can be found at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Varargs. Unprototyped function information is detailed at MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Calling Convention: Unprototypesd Functions.

Alignment

Most structures are aligned to their natural alignment. The primary exceptions are the stack pointer and malloc/alloca memory, which are aligned to 16 byte, in order to aid performance. Alignment above 16 bytes must be done manually, but since 16 bytes is a common alignment size for XMM operations, this should suffice for most code. For more information about structure layout and alignment see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Types and Storage. For information about the stack layout, see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Stack Usage.

Unwindability

All non-leaf functions [functions that neither call a function, nor allocate any stack space themselves] must be annotated with data [referred to as xdata or ehdata, which is pointed to from pdata] that describes to the operating system how to properly unwind them, to recover non-volatile registers. Prologs & epilogs are highly restricted, so that they can be properly described in xdata. The stack pointer must be aligned to 16 bytes, except for leaf functions, in any region of code that isn’t part of an epilog or prolog. For details about the proper structure of function prolog & epilogs, see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Prolog and Epilog. For more information about exception handling, and the exception handling/unwinding pdata & xdata see MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions: Exception Handling (x64).

</snip>

Anyway, I hope that helps. I could not believe how long it took to find this stuff on MSDN. For some (lame) reason, blogs are better indexed than MSDN in both MSN search, and Google. Hopefully this entry will make finding this information easier.

Here’s a link to the official x64 ABI documentation, which goes into excruciating detail about this stuff. Sorry if it’s not very readable – I wrote a few parts, along with a few other people, and we’re primarily engineers, not writers. We have had a great UE write take over on this document, and it should be slightly improved when it sees the light of day as part of the VS 2005 documentation, but until then, this is it:

MSDN Library: Dev Tools and Languages: VS2008: VS: VC++: 64-bit Programming: x64 Software Conventions

Hopefully, that link will continue to work for a while. MSDN likes to occasionally restructure the way it references information, just to keep us all on our toes. I've now updated the links 3 times. If they're broken, please ping me and I'll update them!

Comments

  • Anonymous
    March 17, 2005
    Hey, any chance you can nag the doc people to provide an HTML Help or PDF of this specification? Having something offline would be really really great.
  • Anonymous
    March 18, 2005
    That's a very good idea - I'll go talk to our doc guys right now!
  • Anonymous
    March 20, 2005
    Thank you Kevin. :-)
  • Anonymous
    August 03, 2005
    A bit late admittedly... But why, oh why, is there no inline assembly allowed under x64 in C/C++? Is there any hope this might be supported in the future? Much as I love MASM, inline assembly is so much nicer to use.
  • Anonymous
    August 08, 2005
    Very good so far. My problem is that the detailed information leaves out some of the most important details. I am well aware of how structured exception handling works internally in x86 32-bit but no idea what happens in 64-bit.

    The official document is missing a description of where the unwind information is located and how to find the unwind information for a function at some location.

    Also, it would help to describe how exception handling in X64 works in the same detail of the article by Matt Pietrek at:
    http://www.microsoft.com/msj/0197/exception/exception.aspx
  • Anonymous
    February 19, 2006
    The comment has been removed
  • Anonymous
    March 06, 2006
    The comment has been removed
  • Anonymous
    March 07, 2006
    That means that all functions must starts with a 2 byte instruction, and must be preceded by 6 bytes of padding.  I'll write a new post about that in the next few days...
  • Anonymous
    March 07, 2006
    BTW - there's no inline assembly because, well, it's a lot of work, and we didn't have a lot of time to do it in (at least it didn't seem that way at first).  Inline assembly also tends to get in the way of the optimizer, and it also tends to let people write code that works most of the time, until they realize that they didn't understand the ABI, and now the code they deployed to customers can't properly handle exceptions, nor can they debug the crash dumps...  We'll probably be adding inline asm support in the future (not the next product release, though).  In the mean time, there is a pretty complete list of intrinsic functions in intrin.h...
  • Anonymous
    March 16, 2006
    The comment has been removed
  • Anonymous
    March 16, 2006
    I forgot that my all time favorite example of this is:

    HRESULT IDirect3DDevice9::SetRenderState(
    D3DRENDERSTATETYPE State,
    DWORD Value
    );

    Because the D3D programmers were too lazy to add another function that took a float, here is what their documentation says for a few render states:

    "Values for the this render state are floating-point values. Because the IDirect3DDevice9::SetRenderState method accepts DWORD values, your application must cast a variable that contains the value, as shown in the following code example.

    pDevice9->SetRenderState(D3DRS_FOGSTART, ((DWORD) (&fFogStart)));"

    Which assumes sizeof(DWORD) == sizeof(float) and alignof(dword) % alignof(float) == 0. But most importantly this is undefined due to aliasing. The only correct way is to use a memcpy (or equivalent).
  • Anonymous
    April 07, 2006
    Hi Kevin,

    Could you tell anything about _local_unwind routine? From some sources I know that it should be called every time the flow of control leaves guarded block of __try/__finally construct. It takes two arguments and it calls termination handler by itself.

    But I found nothing about this routine in Microsoft documentation! It makes me feel bad :(

    Please, give me any hints about it.
    Thanks!
  • Anonymous
    April 10, 2006
    OK, it seems that _local_unwind just calls RtlUnwindEx with appropriate parameters.
  • Anonymous
    June 30, 2006
    Please add these links to your blog:

    http://msdn.microsoft.com/library/en-us/debug/base/structured_exception_handling_functions.asp

    http://msdn2.microsoft.com/en-us/library/7kcdt6fy.aspx

    This because my fellow workers started using your blog as ultimate source of reality and imho this is dumb.