How to absolute 2 double or 4 floats using SSE instruction set? (Up to SSE4)

Pharaun picture Pharaun · Apr 1, 2011 · Viewed 10.5k times · Source

Here's the sample C code that I am trying to accelerate using SSE, the two arrays are 3072 element long with doubles, may drop it down to float if i don't need the precision of doubles.

double sum = 0.0;

for(k = 0; k < 3072; k++) {
    sum += fabs(sima[k] - simb[k]);
}

double fp = (1.0 - (sum / (255.0 * 1024.0 * 3.0)));

Anyway my current problem is how to do the fabs step in a SSE register for doubles or float so that I can keep the whole calculation in the SSE registers so that it remains fast and I can parallelize all of the steps by partly unrolling this loop.

Here's some resources I've found fabs() asm or possibly this flipping the sign - SO however the weakness of the second one would need a conditional check.

Answer

Norbert P. picture Norbert P. · May 13, 2011

I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:

inline __m128 abs_ps(__m128 x) {
    static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
    return _mm_andnot_ps(sign_mask, x);
}

inline __m128d abs_pd(__m128d x) {
    static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
    return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}

Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:

double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;

__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
    sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
    sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}

__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}

By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.

Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.