Tail call performance on x86

MSIL includes the tail. prefix to be used with call instructions (call, callvirt, calli), and also the specialized form jmp. The tail. prefix is a hint that the stack frame of the current method should be popped from the stack before transferring control to the target of the call. This is important for functional languages like Scheme which rely on recursion. Without tail calls, they would run into StackOverflows on a routine basis.

The CLR implementation of tail calls on x86 is not the most optimal it could be. We have improved the performance of tail calls in Whidbey (V2). Most tail calls are about 50% faster. However, there is still some work to be done. This blog explains why implementing tail calls is not as straight-forward as it might seem.

The CLR has an accurate tracing Garbage collector - as opposed to a conservative one like the Boehm collector. This puts some requirements on the native code compiled from the IL, and also on the tracking information corresponding to it (similar to pdata on RISC platforms). One of the requirements is that the CLR be able to determine exactly where the return address of the method is on the stack. This is required so that the CLR can “hijack” the return address by temporarily overwriting it with the address of one of the hijacking function in the CLR. This is done so the CLR will get control when the method returns, and the CLR can perform a garbage-collection.

This requirement makes it difficult to move the return address around. Moving the return address around is needed while doing a tail call to a method with a different number of stack-based arguments as the return address gets pushed after the stack arguments. The following diagram shows how the stack looks at various points when a method foo is called normally, and it does a tail call to bar which expects a different number of arguments on the stack. Note that the return address needs to be moved by foo before it does the tail call.

 As soon as foo While foo When foo is about to
is called is executing tail call to bar

                      | |
+--------------+
| |
| |
| foo's frame |
| | | |
| | +--------------+
| | | | |return address|
+--------------+ +--------------+ +--------------+
|return address| |return address| | |
+--------------+ +--------------+ | |
| | | | | |
| stack args | | stack args | | stack args |
| of foo | | of foo | | of bar |
| | | | | |
+--------------+ +--------------+ +--------------+
| | | | | |
| | | | | |
| | | | | |
| Frame of | | Frame of | | Frame of |
|foo's caller | |foo's caller | |foo's caller |
| | | | | |
| | | | | |

Furthermore, we tend to be really careful when changing the hijacking code as any bugs introduced in this code are not easily caught, and instead can become GC holes (corrupt GC references). Most programs will seem to work fine, but if a GC is triggered at just the right spot, some poor object reference will get corrupted.

Another (secondary) concern is GC starvation. If two methods recursively tail call to each other, it causes an infinite loop and the hijacking mechanism does not get a chance to kick in. This reduces the opportunity for GC to suspend the thread and do a collection. This could cause other threads to block longer while they wait for GC to free up some heap memory.

We do know how to make this even faster, and would have loved to have fixed this. However, because of schedule pressure and the risk associated, further improvements will have to wait until after Whidbey (V2.0).

FWIW, the jmp IL instruction is faster since the return address does not have to be shuffled in this case. Also, tail calls will be much faster on 64-bit platforms as the arguments for most methods fit in registers (and so the return address does not need to be moved around). In fact, on 64-bit platforms, the CLR tries to do tail calls even if the tail. prefix is not specified. This is because platform-independent IL programs will get the same amount of stack space reserved for them by the OS, even though the 64-bit process will consume stack at a faster rate. Hence, we try to reduce stack usage by doing tail calls.

Comments

  • Anonymous
    January 25, 2005
    Very cool information. I hope you can do more posts on the JIT and so on, as it's an area of interest for me :).

  • Anonymous
    January 25, 2005
    Is it the response to this "bug"?
    http://lab.msdn.microsoft.com/productfeedback/viewfeedback.aspx?feedbackid=58ece8e2-6914-4049-9d1a-2d040b705512

  • Anonymous
    January 25, 2005
    This post is currently just a placeholder. I will add more information later on.

  • Anonymous
    January 25, 2005
    Yes this is basically the response for the bug (http://lab.msdn.microsoft.com/productfeedback/viewfeedback.aspx?feedbackid=58ece8e2-6914-4049-9d1a-2d040b705512)

    The same comments have been added to the bug and now the MSDN team is processing it so you should see an update soon.

  • Anonymous
    January 26, 2005
    The comment has been removed

  • Anonymous
    April 17, 2006
    PingBack from http://mdavey.wordpress.com/2006/04/18/tail-calls/

  • Anonymous
    July 28, 2007
    What's tail recursion If you know its nothing to do with any of your pet's tail then get onto the next

  • Anonymous
    March 31, 2009
    在上文《尾递归与Continuation》里,我们谈到了尾递归的概念和示例,不过有些朋友对于尾递归的功效依然有所怀疑。因此现在,老赵再简单讲解一下尾递归的优化原理,希望能给大家以一定理性认识。

  • Anonymous
    March 31, 2009
    在上文《尾递归与Continuation》里,我们谈到了尾递归的概念和示例,不过有些朋友对于尾递归的功效依然有所怀疑。因此现在,老赵再简单讲解一下尾递归的优化原理,希望能给大家以一定理性认识。

  • Anonymous
    April 09, 2009
    在上文《尾递归与Continuation》里,我们谈到了尾递归的概念和示例,不过有些朋友对于尾递归的功效依然有所怀疑。因此现在,老赵再简单讲解一下尾递归的优化原理,希望能给大家以一定理性认识。

  • Anonymous
    April 15, 2009
    Po Googlovaní toho tailcallu, jsem našel jen pár dalších stěžovatelů si na jejich rychlost. Potenciálně