I want to multiply with SSE4 a __m128i
object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8
?
A (potentially) faster way than Marat's solution based on Agner Fog's solution:
Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.
__m128i mullo_epi8(__m128i a, __m128i b)
{
// unpack and multiply
__m128i dst_even = _mm_mullo_epi16(a, b);
__m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
// repack
#ifdef __AVX2__
// only faster if have access to VPBROADCASTW
return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}
Agner uses the blendv_epi8
intrinsic with SSE4.1 support.
Edit:
Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).
vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1
It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.