SSE multiplication 16 x uint8_t

Question 1

SSE multiplication 16 x uint8_t

x86 sse simd sse4

Roby · Nov 19, 2011 · Viewed 9.1k times · Source

Answer

Answer

A (potentially) faster way than Marat's solution based on Agner Fog's solution:

Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.

__m128i mullo_epi8(__m128i a, __m128i b)
{
    // unpack and multiply
    __m128i dst_even = _mm_mullo_epi16(a, b);
    __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
    // repack
#ifdef __AVX2__
    // only faster if have access to VPBROADCASTW
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}

Agner uses the blendv_epi8 intrinsic with SSE4.1 support.

Edit:

Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).

vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1

It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.

Question 2

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

SSE multiplication 16 x uint8_t

Answer

Related questions