FPO

I was chatting with one of the perf guys last week and he mentioned something that surprised me greatly.  Apparently he's having perf issues that appear to be associated with a 3rd party driver.  Unfortunately, he's having problems figuring out what's going wrong because the vendor wrote the driver used FPO (and hasn't provided symbols), so the perf guy can't track the root cause of the problem.

The reason I was surprised was that I didn't realize that ANYONE was using FPO any more.

What's FPO?

To know the answer, you have to go way back into prehistory.

Intel's 8088 processor had an extremely limited set of registers (I'm ignoring the segment registers), they were:

AX BX CX DX IP
SI DI BP SP FLAGS

With such a limited set of registers, the registers were all assigned specific purposes.  AX, BX, CX, and DX were the "General Purpose" registers, SI and DI were "Index" registers, SP was the "Stack Pointer", BP was the "Frame Pointer", IP was the "Instruction Pointer", and FLAGS was a read-only register that contained several bits that were indicated information about the processors' current state (whether the result of the previous arithmetic or logical instruction was 0, for instance).

The BX, SI, DI and BP registers were special because they could be used as "Index" registers.  Index registers are critically important to a compiler, because they are used to access memory through a pointer.  In other words, if you have a structure that's located at offset 0x1234 in memory, you can set an index register to the value 0x1234 and access values relative to that location.  For example:

MOV    BX, [Structure]
MOV    AX, [BX]+4

Will set the BX register to the value of the memory pointed to by [Structure] and set the value of AX to the WORD located at the 4th byte relative to the start of that structure.

One thing to note is that the SP register wasn't an index register.  That meant that to access variables on the stack, you needed to use a different register, that's where the BP register came from - the BP register was dedicated to accessing values on the stack.

When the 386 came out, they stretched the various registers to 32bits, and they fixed the restrictions that only BX, SI, DI and BP could be used as index registers.

EAX EBX ECX EDX EIP
ESI EDI EBP ESP FLAGS

This was a good thing, all of a sudden, instead of being constrained to 3 index registers, the compiler could use 6 of them.

Since index registers are used for structure access, to a compiler they're like gold - more of them is a good thing, and it's worth almost any amount of effort to gain more of them.

Some extraordinarily clever person realized that since ESP was now an index register the EBP register no longer had to be dedicated for accessing variables on the stack.  In other words, instead of:

MyFunction:
    PUSH    EBP
    MOV     EBP, ESP
    SUB      ESP, <LocalVariableStorage>
    MOV     EAX, [EBP+8]
      :
      :
    MOV     ESP, EBP
    POP      EBP
    RETD

to access the 1st parameter on the stack (EBP+0 is the old value of EBP, EBP+4 is the return address), you can instead do:

MyFunction:
    SUB      SP, <LocalVariableStorage>
    MOV     EAX, [ESP+4+<LocalVariableStorage>]
      :
      :
    ADD     SP, <LocalVariableStorage>
    RETD

This works GREAT - all of a sudden, EBP can be repurposed and used as another general purpose register!  The compiler folks called this optimization "Frame Pointer Omission", and it went by the acronym FPO.

But there's one small problem with FPO.

If you look at the pre-FPO example for MyFunction, you'd notice that the first instruction in the routine was PUSH EBP followed by a MOV EBP, ESP.  That had an interesting and extremely useful side effect.  It essentially created a singly linked list that linked the frame pointer for each of the callers to a function.  From the EBP for a routine, you could recover the entire call stack for a function.  This was unbelievably useful for debuggers - it meant that call stacks were quite reliable, even if you didn't have symbols for all the modules being debugged.  Unfortunately, when FPO was enabled, that list of stack frames was lost - the information simply wasn't being tracked.

To solve the is problem, the compiler guys put the information that was lost when FPO was enabled into the PDB file for the binary.  Thus, when you had symbols for the modules, you could recover all the stack information.

FPO was enabled for all Windows binaries in NT 3.51, but was turned off for Windows binaries in Vista because it was no longer necessary - machines got sufficiently faster since 1995 that the performance improvements that were achieved by FPO weren't sufficient to counter the pain in debugging and analysis that FPO caused.

 

Edit: Clarified what I meant by "FPO was enabled in NT 3.51" and "was turned off in Vista", thanks Steve for pointing this out.

Comments

  • Anonymous
    March 12, 2007
    Everybody who's using default MSVC options is using FPO. As of MSVC 2005 /O1, /O2 and /Ox (I forget which is default for Release) all include /Oy.

  • Anonymous
    March 12, 2007
    So for those of who have symbols for our modules, is there any debugging advantage to disabling FPO?

  • Anonymous
    March 12, 2007
    If you list an option as an "optimization" and the program doesn't crash, people are going to use it even if they don't know whether it's really beneficial. I'd turn it off permanently if I could turn it on for only specific loops with a pragma, but I haven't seen docs pointing to that possibility.

  • Anonymous
    March 12, 2007
    The comment has been removed

  • Anonymous
    March 12, 2007
    It's the normal rule: if you think an optimisation may be useful, use the profiler to confirm that it's actually measurable, and significant.

  • Anonymous
    March 12, 2007
    > Some extraordinarily clever person realized that since ESP > was now an index register the EBP register no longer had to > be dedicated for accessing variables on the stack. No, that was just ordinary cleverness.  For comparison, I was using R13 as a base register in addition to pointing to the save area, on IBM 360 and 370 before Intel's 8080 and 8086 existed.  Registers were gold to assembly language coders too.

  • Anonymous
    March 12, 2007
    Disabling FPO can have both serious code size and performance impact. Tail call optimizations have to be disabled when a frame pointer is present, leading to much greater stack usage in affected paths. Small functions are also disproportionately affected by prolog/epilog code. Third, although there are still six registers available with a frame pointer on X86, only three of them are nonvolatile with respect to nested calls: EBX, ESI, and EDI. Opening up a fourth register can drop out a bunch of spill code. That having been said, FPO is often minor in impact, mainly because ESP-based addressing takes an extra byte per instruction, and any aligned objects on the stack will force an aligned frame pointer anyway. The issue with call stacks is a problem with the Win32 ABI and the debugging information written by VC++, not with FPO itself. Not only does VC++ not write enough information to always crawl past FPO functions, the __stdcall calling convention makes it impossible to statically determine ESP offsets since it is caller-pops. On other compilers or platforms, either the calling convention and ISA are simple enough that instruction analysis can reliably determine the ESP-to-return-address offset, or the required information is available due to table-based exception handling (X64).

  • Anonymous
    March 12, 2007
    > the __stdcall calling convention makes it impossible to > statically determine ESP offsets since it is caller-pops. No, isn't it callee-pops?  Called function pops its own arguments from stack when returning.  Accomplished on the x86 by using the "ret x" form of the x86 return instruction. http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclang/html/_core___stdcall.asp

  • Anonymous
    March 13, 2007
    __cdecl is caller-pops. __stdcall is callee-pops. (caller-pops is needed to support variadic functions like the printf family).

  • Anonymous
    March 13, 2007
    Is it just not worthwhile to add a feature to your debugger to walk through the stack effects of taking the next return from the function??  I suppose this wouldn't work in all cases, but it should work in many. This strategy might not work right if you have a really messed up stack, but the code should give you enough information to pop the correct amount from the stack since the processor needs to do it correctly too!  With some added heuristics about return addresses, this could even be made to work quite reliably. Maybe this is done already in windbg and I'm just being an idiot.

  • Anonymous
    March 13, 2007
    The comment has been removed

  • Anonymous
    March 13, 2007
    How was this turned off in Vista?  Isn't FPO done at executable compile time?  Do you just mean this was not used by OS code in Vista or that Vista somehow stops applications from using it?

  • Anonymous
    March 14, 2007
    The comment has been removed

  • Anonymous
    March 14, 2007
    FPO is pretty much the standard out there in the world today; I'm surprised that you are surprised about it :) Microsoft is the exception in that it does things like turn off FPO for DLLs and rebases their modules and all sorts of other little things that virtually nobody else in the world really thinks to do.  Of course, the fact that VS still continues to default to enabling FPO doesn't really help the situation either. In any case, though, it's not too hard to do a manual stack trace in the lack of FPO; see the dds' command in WinDbg (for instance, dds @esp').  That and a little bit of disassembly make it quite doable to manually construct a call stack where there are no frame pointers involved.  More work, yes, but not a blocker. The places where FPO really begins to be a problem are where you have automated processes that take snapshots of stacks for debugging purposes (e.g. page heap, or handle tracing).  There, you really lose out because those features can't handle FPO frames on x86 and you lose the ability to easily track down heap corruption or handle leak style problems in many circumstances if a function wrappering a handle or memory allocation API has FPO enabled. Winidows on x64 makes this a nonissue with the stringent requirements on calling conventions (and unwind metadata) that allow for perfect unwinds in all circumstances without symbols.

  • Anonymous
    March 14, 2007
    The comment has been removed

  • Anonymous
    March 14, 2007
    Did the syntax really change that much from 16-bit to 32-bit? Perhaps the following patch makes it more correct?

  • MOV    AX, [BX]+4
  • MOV    AX, [BX+4]
  • Anonymous
    March 14, 2007
    The comment has been removed

  • Anonymous
    March 15, 2007
    Ok I see now, you're talking about static analysis at runtime to try to determine return addresses.  (I first thought you're talking about the compiler.) I guess __stdcall would be a pretty common roadblock to the static analysis you described, though there are other difficulties that could arise, such as encountering an unconditional indirect jump. >> I'm not even sure this is resolvable even with debug information, because if the preceding call is polymorphic the call stack may not be enough to determine which function is executing. >> Well at least in C++, even if the call is polymorphic, I think the function prototype of the method in the class declaration would ensure that the number of parameters (which has to be fixed for __stdcall) and the calling convention cannot vary amongst the possible different classes.

  • Anonymous
    March 26, 2007
    foxyshadis: OPTION PROLOGUE:NONE OPTION EPILOGUE:NONE before your procedure and OPTION PROLOGUE:PrologueDef OPTION EPILOGUE:EpilogueDef after your procedure is the way to do this in MASM.  So I'm pretty sure VC++ exposes some pragma to do the same in C/C++.

  • Anonymous
    March 30, 2007
    >especially on AMD64 where we have eight more general-purpose registers to use. I would like to clear a misconception here -- those registers come with cost. Each instruction referencing them has an extra byte prefix hampering instruction decoding throughput. It is even advisable to use 32-bit parts instead of full registers whenever possible. Another thing to consider is that hardware register renamer has been pretty efficient so having more (developer and complier) visible registers doesn't mean much for x86. I wrote a function in assembler (highly optimized piece of code if I might add) which uses EBP as a general purpose register and [ESP + offset] to access parameters and local variables on the stack. It does pay off, but not unless you are sorely missing a register like I did.

  • Anonymous
    May 09, 2007
    That is exactly what FPO is.

  • Anonymous
    May 15, 2007
    Windows symbols are used for two purposed as I know. One is providing FPO data for call stacks, other is providing names for symbols themselves. This means, pdb file sizes for Vista are reduced a bit comparatively, disregarding the additional code and functionality? Right or wrong?