Consider a single memory access (a single read or a single write, not read+write) SSE instruction on an x86 CPU. The instruction is accessing 16 bytes (128 bits) of memory and the accessed memory location is aligned to 16 bytes.
The document "Intel® 64 Architecture Memory Ordering White Paper" states that for "Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary" the memory operation appears to execute as a single memory access regardless of memory type.
The question: Do there exist Intel/AMD/etc x86 CPUs which guarantee that reading or writing 16 bytes (128 bits) aligned to a 16 byte boundary executes as a single memory access? Is so, which particular type of CPU is it (Core2/Atom/K8/Phenom/...)? If you provide an answer (yes/no) to this question, please also specify the method that was used to determine the answer - PDF document lookup, brute force testing, math proof, or whatever other method you used to determine the answer.
This question relates to problems such as http://research.swtch.com/2010/02/off-to-races.html
Update:
I created a simple test program in C that you can run on your computers. Please compile and run it on your Phenom, Athlon, Bobcat, Core2, Atom, Sandy Bridge or whatever SSE2-capable CPU you happen to have. Thanks.
// Compile with:
// gcc -o a a.c -pthread -msse2 -std=c99 -Wall -O2
//
// Make sure you have at least two physical CPU cores or hyper-threading.
#include <pthread.h>
#include <emmintrin.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
typedef int v4si __attribute__ ((vector_size (16)));
volatile v4si x;
unsigned n1[16] __attribute__((aligned(64)));
unsigned n2[16] __attribute__((aligned(64)));
void* thread1(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n1[mask]++;
x = (v4si){0,0,0,0};
}
return NULL;
}
void* thread2(void *arg) {
for (int i=0; i<100*1000*1000; i++) {
int mask = _mm_movemask_ps((__m128)x);
n2[mask]++;
x = (v4si){-1,-1,-1,-1};
}
return NULL;
}
int main() {
// Check memory alignment
if ( (((uintptr_t)&x) & 0x0f) != 0 )
abort();
memset(n1, 0, sizeof(n1));
memset(n2, 0, sizeof(n2));
pthread_t t1, t2;
pthread_create(&t1, NULL, thread1, NULL);
pthread_create(&t2, NULL, thread2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
for (unsigned i=0; i<16; i++) {
for (int j=3; j>=0; j--)
printf("%d", (i>>j)&1);
printf(" %10u %10u", n1[i], n2[i]);
if(i>0 && i<0x0f) {
if(n1[i] || n2[i])
printf(" Not a single memory access!");
}
printf("\n");
}
return 0;
}
The CPU I have in my notebook is Core Duo (not Core2). This particular CPU fails the test, it implements 16-byte memory read/writes with a granularity of 8 bytes. The output is:
0000 96905702 10512
0001 0 0
0010 0 0
0011 22 12924 Not a single memory access!
0100 0 0
0101 0 0
0110 0 0
0111 0 0
1000 0 0
1001 0 0
1010 0 0
1011 0 0
1100 3092557 1175 Not a single memory access!
1101 0 0
1110 0 0
1111 1719 99975389
In the Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3A, which nowadays contains the specifications of the memory ordering white paper you mention, it is said in section 8.2.3.1, as you note yourself, that
The Intel-64 memory ordering model guarantees that, for each of the following memory-access instructions, the constituent memory operation appears to execute as a single memory access: • Instructions that read or write a single byte. • Instructions that read or write a word (2 bytes) whose address is aligned on a 2 byte boundary. • Instructions that read or write a doubleword (4 bytes) whose address is aligned on a 4 byte boundary. • Instructions that read or write a quadword (8 bytes) whose address is aligned on an 8 byte boundary. Any locked instruction (either the XCHG instruction or another read-modify-write instruction with a LOCK prefix) appears to execute as an indivisible and uninterruptible sequence of load(s) followed by store(s) regardless of alignment.
Now, since the above list does NOT contain the same language for double quadword (16 bytes), it follows that the architecture does NOT guarantee that instructions which access 16 bytes of memory are atomic.
That being said, the last paragraph does hint at a way out, namely the CMPXCHG16B instruction with the LOCK prefix. You can use the CPUID instruction to figure out if your processor supports CMPXCHG16B (the "CX16" feature bit).
In the corresponding AMD document, AMD64 Technology AMD64 Architecture Programmer’s Manual Volume 2: System Programming, I can't find similar clear language.
EDIT: Test program results
(Test program modified to increase #iterations by a factor of 10)
On a Xeon X3450 (x86-64):
0000 999998139 1572 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 1861 999998428
On a Xeon 5150 (32-bit):
0000 999243100 283087 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 756900 999716913
On an Opteron 2435 (x86-64):
0000 999995893 1901 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 0 1101 0 0 1110 0 0 1111 4107 999998099
Does this mean that Intel and/or AMD guarantee that 16 byte memory accesses are atomic on these machines? IMHO, it does not. It's not in the documentation as guaranteed architectural behavior, and thus one cannot know if on these particular processors 16 byte memory accesses really are atomic or whether the test program merely fails to trigger them for one reason or another. And thus relying on it is dangerous.
EDIT 2: How to make the test program fail
Ha! I managed to make the test program fail. On the same Opteron 2435 as above, with the same binary, but now running it via the "numactl" tool specifying that each thread runs on a separate socket, I got:
0000 999998634 5990 0001 0 0 0010 0 0 0011 0 0 0100 0 0 0101 0 0 0110 0 0 0111 0 0 1000 0 0 1001 0 0 1010 0 0 1011 0 0 1100 0 1 Not a single memory access! 1101 0 0 1110 0 0 1111 1366 999994009
So what does this imply? Well, the Opteron 2435 may, or may not, guarantee that 16-byte memory accesses are atomic for intra-socket accesses, but at least the cache coherency protocol running on the HyperTransport interconnect between the two sockets does not provide such a guarantee.
EDIT 3: ASM for the thread functions, on request of "GJ."
Here's the generated asm for the thread functions for the GCC 4.4 x86-64 version used on the Opteron 2435 system:
.globl thread2
.type thread2, @function
thread2:
.LFB537:
.cfi_startproc
movdqa .LC3(%rip), %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L11:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n2(,%rdx,4)
cmpl $1000000000, %eax
jne .L11
xorl %eax, %eax
ret
.cfi_endproc
.LFE537:
.size thread2, .-thread2
.p2align 5,,31
.globl thread1
.type thread1, @function
thread1:
.LFB536:
.cfi_startproc
pxor %xmm1, %xmm1
xorl %eax, %eax
.p2align 5,,24
.p2align 3
.L15:
movaps x(%rip), %xmm0
incl %eax
movaps %xmm1, x(%rip)
movmskps %xmm0, %edx
movslq %edx, %rdx
incl n1(,%rdx,4)
cmpl $1000000000, %eax
jne .L15
xorl %eax, %eax
ret
.cfi_endproc
and for completeness, .LC3 which is the static data containing the (-1, -1, -1, -1) vector used by thread2:
.LC3:
.long -1
.long -1
.long -1
.long -1
.ident "GCC: (GNU) 4.4.4 20100726 (Red Hat 4.4.4-13)"
.section .note.GNU-stack,"",@progbits
Also note that this is AT&T ASM syntax, not the Intel syntax Windows programmers might be more familiar with. Finally, this is with march=native which makes GCC prefer MOVAPS; but it doesn't matter, if I use march=core2 it will use MOVDQA for storing to x, and I can still reproduce the failures.