How to get data out of AVX registers?

MSalters picture MSalters · Jun 3, 2016 · Viewed 7.9k times · Source

Using MSVC 2013 and AVX 1, I've got 8 floats in a register:

__m256 foo = mm256_fmadd_ps(a,b,c);

Now I want to call inline void print(float) {...} for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:

print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...

but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.

Bonus Q: I'd of course like to write

for(int i = 0; i !=8; ++i) 
    print(_castu32_f32(_mm256_extract_epi32(foo, i)))

but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo?

Answer

Paul R picture Paul R · Jun 3, 2016

Assuming you only have AVX (i.e. no AVX2) then you could do something like this:

float extract_float(const __m128 v, const int i)
{
    float x;
    _MM_EXTRACT_FLOAT(x, v, i);
    return x;
}

void print(const __m128 v)
{
    print(extract_float(v, 0));
    print(extract_float(v, 1));
    print(extract_float(v, 2));
    print(extract_float(v, 3));
}

void print(const __m256 v)
{
    print(_mm256_extractf128_ps(v, 0));
    print(_mm256_extractf128_ps(v, 1));
}

However I think I would probably just use a union:

union U256f {
    __m256 v;
    float a[8];
};

void print(const __m256 v)
{
    const U256f u = { v };

    for (int i = 0; i < 8; ++i)
        print(u.a[i]);
}