Using AVX CPU instructions: Poor performance without "/arch:AVX"

Mike picture Mike · Oct 20, 2011 · Viewed 29.2k times · Source

My C++ code uses SSE and now I want to improve it to support AVX when it is available. So I detect when AVX is available and call a function that uses AVX commands. I use Win7 SP1 + VS2010 SP1 and a CPU with AVX.

To use AVX, it is necessary to include this:

#include "immintrin.h"

and then you can use intrinsics AVX functions like _mm256_mul_ps, _mm256_add_ps etc. The problem is that by default, VS2010 produces code that works very slowly and shows the warning:

warning C4752: found Intel(R) Advanced Vector Extensions; consider using /arch:AVX

It seems VS2010 actually does not use AVX instructions, but instead, emulates them. I added /arch:AVX to the compiler options and got good results. But this option tells the compiler to use AVX commands everywhere when possible. So my code may crash on CPU that does not support AVX!

So the question is how to make VS2010 compiler to produce AVX code but only when I specify AVX intrinsics directly. For SSE it works, I just use SSE intrinsics functions and it produce SSE code without any compiler options like /arch:SSE. But for AVX it does not work for some reason.

Answer

Mysticial picture Mysticial · Oct 20, 2011

The behavior that you are seeing is the result of expensive state-switching.

See page 102 of Agner Fog's manual:

http://www.agner.org/optimize/microarchitecture.pdf

Every time you improperly switch back and forth between SSE and AVX instructions, you will pay an extremely high (~70) cycle penalty.

When you compile without /arch:AVX, VS2010 will generate SSE instructions, but will still use AVX wherever you have AVX intrinsics. Therefore, you'll get code that has both SSE and AVX instructions - which will have those state-switching penalties. (VS2010 knows this, so it emits that warning you're seeing.)

Therefore, you should use either all SSE, or all AVX. Specifying /arch:AVX tells the compiler to use all AVX.

It sounds like you're trying to make multiple code paths: one for SSE, and one for AVX. For this, I suggest you separate your SSE and AVX code into two different compilation units. (one compiled with /arch:AVX and one without) Then link them together and make a dispatcher to choose based on the what hardware it's running on.

If you need to mix SSE and AVX, be sure to use _mm256_zeroupper() or _mm256_zeroall() appropriately to avoid the state-switching penalties.