Using MSVC 2013 and AVX 1, I've got 8 floats in a register:
__m256 foo = mm256_fmadd_ps(a,b,c);
Now I want to call inline void print(float) {...}
for all 8 floats. It looks like the Intel AVX intrisics would make this rather complicated:
print(_castu32_f32(_mm256_extract_epi32(foo, 0)));
print(_castu32_f32(_mm256_extract_epi32(foo, 1)));
print(_castu32_f32(_mm256_extract_epi32(foo, 2)));
// ...
but MSVC doesn't even have either of these two intrinsics. Sure, I could write back the values to memory and load from there, but I suspect that at assembly level there's no need to spill a register.
Bonus Q: I'd of course like to write
for(int i = 0; i !=8; ++i)
print(_castu32_f32(_mm256_extract_epi32(foo, i)))
but MSVC doesn't understand that many intrinsics require loop unrolling. How do I write a loop over the 8x32 floats in __m256 foo
?
Assuming you only have AVX (i.e. no AVX2) then you could do something like this:
float extract_float(const __m128 v, const int i)
{
float x;
_MM_EXTRACT_FLOAT(x, v, i);
return x;
}
void print(const __m128 v)
{
print(extract_float(v, 0));
print(extract_float(v, 1));
print(extract_float(v, 2));
print(extract_float(v, 3));
}
void print(const __m256 v)
{
print(_mm256_extractf128_ps(v, 0));
print(_mm256_extractf128_ps(v, 1));
}
However I think I would probably just use a union:
union U256f {
__m256 v;
float a[8];
};
void print(const __m256 v)
{
const U256f u = { v };
for (int i = 0; i < 8; ++i)
print(u.a[i]);
}