SSE: Difference between _mm_load/store vs. using direct pointer access

Peter picture Peter · Jun 14, 2012 · Viewed 7.3k times · Source

Suppose I want to add two buffers and store the result. Both buffers are already allocated 16byte aligned. I found two examples how to do that.

The first one is using _mm_load to read the data from the buffer into an SSE register, does the add operation and stores back to the result register. Until now I would have done it like that.

void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
  for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
  {
    __m128i _s = _mm_load_si128( (__m128i*) src );
    __m128i _d = _mm_load_si128( (__m128i*) dst );

    _d = _mm_add_epi16( _d, _s );

    _mm_store_si128( (__m128i*) dst, _d );
  }
}

The second example just did the add operations directly on the memory addresses without load/store operation. Both seam to work fine.

void _add( uint16_t * dst, uint16_t const * src, size_t n )
{
  for( uint16_t const * end( dst + n ); dst != end; dst+=8, src+=8 )
  {
    *(__m128i*) dst = _mm_add_epi16( *(__m128i*) dst, *(__m128i*) src );
  }
}

So the question is if the 2nd example is correct or may have any side effects and when to use load/store is mandatory.

Thanks.

Answer

Paul R picture Paul R · Jun 14, 2012

Both versions are fine - if you look at the generated code you will see that the second version still generates at least one load to a vector register, since PADDW (aka _mm_add_epi16) can only get its second argument directly from memory.

In practice most non-trivial SIMD code will do a lot more operations between loading and storing data than just a single add, so in general you probably want to load data initially to vector variables (registers) using _mm_load_XXX, perform all your SIMD operations on registers, then store the results back to memory via _mm_store_XXX.