Is the SSE unaligned load intrinsic any slower than the aligned load intrinsic on x64_64 Intel CPUs?

Robert T. McGibbon picture Robert T. McGibbon · Nov 28, 2013 · Viewed 7.6k times · Source

I'm considering changing some code high performance code that currently requires 16 byte aligned arrays and uses _mm_load_ps to relax the alignment constraint and use _mm_loadu_ps. There are a lot of myths about the performance implications of memory alignment for SSE instructions, so I made a small test case of what should be a memory-bandwidth bound loop. Using either the aligned or unaligned load intrinsic, it runs 100 iterations through a large array, summing the elements with SSE intrinsics. The source code is here. https://gist.github.com/rmcgibbo/7689820

The results on a 64 bit Macbook Pro with a Sandy Bridge Core i5 are below. Lower numbers indicate faster performance. As I read the results, I see basically no performance penalty from using _mm_loadu_ps on unaligned memory.

I find this surprising. Is this a fair test / justified conclusion? On what hardware platforms is there a difference?

$ gcc -O3 -msse aligned_vs_unaligned_load.c  && ./a.out  200000000
Array Size: 762.939 MB
Trial 1
_mm_load_ps with aligned memory:    0.175311
_mm_loadu_ps with aligned memory:   0.169709
_mm_loadu_ps with unaligned memory: 0.169904
Trial 2
_mm_load_ps with aligned memory:    0.169025
_mm_loadu_ps with aligned memory:   0.191656
_mm_loadu_ps with unaligned memory: 0.177688
Trial 3
_mm_load_ps with aligned memory:    0.182507
_mm_loadu_ps with aligned memory:   0.175914
_mm_loadu_ps with unaligned memory: 0.173419
Trial 4
_mm_load_ps with aligned memory:    0.181997
_mm_loadu_ps with aligned memory:   0.172688
_mm_loadu_ps with unaligned memory: 0.179133
Trial 5
_mm_load_ps with aligned memory:    0.180817
_mm_loadu_ps with aligned memory:   0.172168
_mm_loadu_ps with unaligned memory: 0.181852

Answer

creichen picture creichen · Nov 28, 2013

You have a lot of noise in your results. I re-ran this on a Xeon E3-1230 V2 @ 3.30GHz running Debian 7, doing 12 runs (discarding the first to account for virtual memory noise) over a 200000000 array, with 10 iterations for the i within the benchmark functions, explicit noinline for the functions you provided, and each of your three benchmarks running in isolation: https://gist.github.com/creichen/7690369

This was with gcc 4.7.2.

The noinline ensured that the first benchmark wasn't optimised out.

The exact call being

./a.out 200000000 10 12 $n

for $n from 0 to 2.

Here are the results:

load_ps aligned

min:    0.040655
median: 0.040656
max:    0.040658

loadu_ps aligned

min:    0.040653
median: 0.040655
max:    0.040657

loadu_ps unaligned

min:    0.042349
median: 0.042351
max:    0.042352

As you can see, these are some very tight bounds that show that loadu_ps is slower on unaligned access (slowdown of about 5%) but not on aligned access. Clearly on that particular machine loadu_ps pays no penalty on aligned memory access.

Looking at the assembly, the only difference between the load_ps and loadu_ps versions is that the latter includes a movups instruction, re-orders some other instructions to compensate, and uses slightly different register names. The latter is probably completely irrelevant and the former can get optimised out during microcode translation.

Now, it's hard to tell (without being an Intel engineer with access to more detailed information) whether/how the movups instruction gets optimised out, but considering that the CPU silicon would pay little penalty for simply using the aligned data path if the lower bits in the load address are zero and the unaligned data path otherwise, that seems plausible to me.

I tried the same on my Core i7 laptop and got very similar results.

In conclusion, I would say that yes, you do pay a penalty for unaligned memory access, but it is small enough that it can get swamped by other effects. In the runs you reported there seems to be enough noise to allow for the hypothesis that it is slower for you too (note that you should ignore the first run, since your very first trial will pay a price for warming up the page table and caches.)