Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.
The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
__m128i _s = _mm_load_si128( (__m128i*) src );
__m128i _d = _mm_load_si128( (__m128i*) dst );
_d = _mm_add_epi16( _d, _s );
_mm_store_si128( (__m128i*) dst, _d );
}
}
The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.
void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
{
*(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
}
}
So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.
Thanks.
Both versions are fine - if you look at the generated code you will see that the second version still generates at least one load to a vector register, since PADDW
(aka _mm_add_epi16
) can only get its second argument directly from memory.
In practice most non-trivial SIMD code will do a lot more operations between loading and storing data than just a single add, so in general you probably want to load data initially to vector variables (registers) using _mm_load_XXX
, perform all your SIMD operations on registers, then store the results back to memory via _mm_store_XXX
.