What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses? Will it flush reorder buffer in out-of-order CPU?
What effects will be on the cache?
I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.
I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost of atomic ops with the costs of cache misses.
The cost of the x86 LOCK prefix, (including lock cmpxchg
for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).
i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual lock
ed operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.
Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)
Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.
He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:
In all cases, a cacheline request can be stalled by other processors already modifying the data.