Efficient 4x4 matrix vector multiplication with SSE: horizontal add and dot product - what's the point?

user2088790 picture user2088790 · Feb 19, 2013 · Viewed 10.4k times · Source

I am trying to find the most efficient implementation of 4x4 matrix (M) multiplication with a vector (u) using SSE. I mean Mu = v.

As far as I understand there are two primary ways to go about this:

    method 1) v1 = dot(row1, u), v2 = dot(row2, u), v3 = dot(row3, u), v4 = dot(row4, u)
    method 2) v = u1 col1 + u2 col2 + u3 col3 + u4 col4.

Method 2 is easy to implement in SSE2. Method 1 can be implement with either the horizontal add instruction in SSE3 or the dot product instruction in SSE4. However, in all my tests method 2 always outperforms method 1.

One place where I though method 1 would have an advantage is in a 3x4 matrix, for example for affine transform. In this case the last dot product is unnecessary. But even in this case method 2 on a 4x4 matrix is faster than method 1 on a 3x4 matrix. The only method I have found that is faster than method 2 on a 4x4 matrix is method 2 on a 4x3 matrix.

So what's the point of the horizontal add and the dot product instruction? In fact the dot production instruction gives the worst performance in this case. Maybe it has something to do with the format of the data? If one can't define how the matrix is ordered then a transpose is necessary and in that case maybe method 1 would be better?

See below for some code.

__m128 m4x4v_colSSE(const __m128 cols[4], const __m128 v) {
  __m128 u1 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(0,0,0,0));
  __m128 u2 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(1,1,1,1));
  __m128 u3 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(2,2,2,2));
  __m128 u4 = _mm_shuffle_ps(v,v, _MM_SHUFFLE(3,3,3,3));

  __m128 prod1 = _mm_mul_ps(u1, cols[0]);
  __m128 prod2 = _mm_mul_ps(u2, cols[1]);
  __m128 prod3 = _mm_mul_ps(u3, cols[2]);
  __m128 prod4 = _mm_mul_ps(u4, cols[3]);

  return _mm_add_ps(_mm_add_ps(prod1, prod2), _mm_add_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE3(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_mul_ps(rows[0], v);
  __m128 prod2 = _mm_mul_ps(rows[1], v);
  __m128 prod3 = _mm_mul_ps(rows[2], v);
  __m128 prod4 = _mm_mul_ps(rows[3], v);

  return _mm_hadd_ps(_mm_hadd_ps(prod1, prod2), _mm_hadd_ps(prod3, prod4));
}

__m128 m4x4v_rowSSE4(const __m128 rows[4], const __m128 v) {
  __m128 prod1 = _mm_dp_ps (rows[0], v, 0xFF);
  __m128 prod2 = _mm_dp_ps (rows[1], v, 0xFF);
  __m128 prod3 = _mm_dp_ps (rows[2], v, 0xFF);
  __m128 prod4 = _mm_dp_ps (rows[3], v, 0xFF);

  return _mm_shuffle_ps(_mm_movelh_ps(prod1, prod2), _mm_movelh_ps(prod3, prod4),  _MM_SHUFFLE(2, 0, 2, 0));
}  

Answer

Marat Dukhan picture Marat Dukhan · Feb 20, 2013

Horizontal add and dot product instructions are complex: they are decomposed into multiple simpler microoperations which are executed by processor just like simple instructions. The exact decomposition of horizontal add and dot product instructions into microoperations is processor-specific, but for recent Intel processors horizontal add is decomposed into 2 SHUFFLE + 1 ADD microoperations, and dot product is decomposed into 1 MUL + 1 SHUFFLE + 2 ADD microoperations. Besides a larger number of microoperations, this instructions also stress the instruction decoder in the processor pipeline: Intel processors can decode only one such complex instruction per cycle (compared to 4 simple instructions). On AMD Bulldozer the relative cost of these complex instructions is even higher.