I'm trying to write some computationally intensive code for Windows x64 target, with SSE or the new AVX instructions, compiling in GCC 4.5.2 and 4.6.1, MinGW64 (TDM GCC build, and some custom build). My compiler options are -O3 -mavx
. (-m64
is implied)
In short, I want to perform some lengthy computation on 4 3D vectors of packed floats. That requires 4x3=12 xmm or ymm registers for storage, and 2 or 3 registers for temporary results. This should IMHO fit snugly in the 16 available SSE (or AVX) registers available for 64bit targets. However, GCC produces a very suboptimal code with register spilling, using only registers xmm0-xmm10
and shuffling data from and onto the stack. My question is:
Is there a way to convince GCC to use all the registers xmm0-xmm15
?
To fix ideas, consider the following SSE code (for illustration only):
void example(vect<__m128> q1, vect<__m128> q2, vect<__m128>& a1, vect<__m128>& a2) {
for (int i=0; i < 10; i++) {
vect<__m128> v = q2 - q1;
a1 += v;
// a2 -= v;
q2 *= _mm_set1_ps(2.);
}
}
Here vect<__m128>
is simply a struct
of 3 __m128
, with natural addition and multiplication by scalar. When the line a2 -= v
is commented out, i.e. we need only 3x3 registers for storage since we are ignoring a2
, the produced code is indeed straightforward with no moves, everything is performed in registers xmm0-xmm10
. When I remove the comment a2 -= v
, the code is pretty awful with a lot of shuffling between registers and stack. Even though the compiler could just use registers xmm11-xmm13
or something.
I actually haven't seen GCC use any of the registers xmm11-xmm15
anywhere in all my code yet. What am I doing wrong? I understand that they are callee-saved registers, but this overhead is completely justified by simplifying the loop code.
Two points:
So, if you want better register allocation, you basically have two options: