Does omitting the frame pointers really have a positive effect on performance and a negative effect on debug-ability?

Patrick picture Patrick · Oct 22, 2012 · Viewed 7.5k times · Source

As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).

However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.

Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?

I did some experiments and noticed that:

  • in a simple 32-bit test executable (compiled using /Ox /Ob0) the omission of frame pointers does increase performance (with about 10%). But this test executable only performs some function calls, nothing else.
  • in my own application the adding/removing of frame pointers don't seem to have a big effect. Adding frame pointers seems to make the application about 5% faster, but that could be within the error margin.

What is the general advice regarding frame pointers?

  • should they be omitted (/Ox) in a release executable because they really have a positive effect on performance?
  • should they be added (/Ox /Oy-) in a release executable because they improve debug-ablity (when debugging with a crash-dump file)?

Using Visual Studio 2010.

Answer

John Dvorak picture John Dvorak · Oct 22, 2012

Short answer: By omitting the frame pointer,

You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assember, this makes your life slightly harder. Much harder if you don't use macros.

You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.

You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticable.

You free up a general purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.

The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).


Long answer:

The typical function entry and exit is:

PUSH SP   ;push the frame pointer
MOV FP,SP ;store the stack pointer in the frame pointer
SUB SP,xx ;allocate space for local variables et al.
...
LEAVE     ;restore the stack pointer and pop the old frame pointer
RET       ;return from the function

An entry and exit without a stack pointer could look like:

SUB SP,xx ;allocate space for local variables et al.
...
ADD SP,xx ;de-allocate space for local variables et al.
RET       ;return from the function.

You will save two instructions but you also duplicate a literal value so the code doesn't get shorter (quite the opposite), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.


You do free up a general purpose register. This has only benefits.

In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this), the seventh argument still fits into a register. The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyways).

The x86 architecture features the [BX/BP/SI/DI] and [BX/BP + SI/DI] addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.

Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).

A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyways before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.

Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyways. Only integers and memory pointers do.


What you do lose by omitting frame pointers is debugability. This answer show why:

If the code crashes, all the debugger needs to do to generate the stack trace is:

    PUSH FP      ; log the current frame pointer as well
$1: CALL log_FP  ; log the frame pointer currently on stack
    LEAVE        ; pop the frame pointer to get the next one
    CMP [FP+4],0
    JNZ $1       ; until the stack cannot be popped (the return address is some specific value)

If the code crashes without a frame pointer, the debugger might have no way to generate the stack trace because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.