How to force gcc to use all SSE (or AVX) registers?

Norbert P. picture Norbert P. · May 11, 2011 · Viewed 7.1k times · Source

I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx. (-m64 is implied)

In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10 and shuffling data from and onto the stack. My question is:

Is there a way to convince GCC to use all the registers xmm0-xmm15?

To fix ideas, consider the following SSE code (for illustration only):

void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
    for (int i=0; i < 10; i++) {
        vect<__m128> v = q2 - q1;
        a1 += v;
//      a2 -= v;

        q2 *= _mm_set1_ps(2.);
    }
}

Here vect<__m128> is simply a struct of 3 __m128, with natural addition and multiplication by scalar. When the line a2 -= v is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10. When I remove the comment a2 -= v, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13 or something.

I actually haven't seen GCC use any of the registers xmm11-xmm15 anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.

Answer

jalf picture jalf · May 11, 2011

Two points:

  • First, You're making a lot of assumptions. Register spilling is pretty cheap on x86 CPUs (due to fast L1 caches and register shadowing and other tricks), and the 64-bit only registers are more costly to access (in terms of larger instructions), so it may just be that GCC's version is as fast, or faster, than the one you want.
  • Second, GCC, like any compiler, does the best register allocation it can. There's no "please do better register allocation" option, because if there was, it'd always be enabled. The compiler isn't trying to spite you. (Register allocation is a NP-complete problem, as I recall, so the compiler will never be able to generate a perfect solution. The best it can do is to approximate)

So, if you want better register allocation, you basically have two options:

  • write a better register allocator, and patch it into GCC, or
  • bypass GCC and rewrite the function in assembly, so you can control exactly which registers are used when.