What Every Programmer Should Know About Memory?

Framester picture Framester · Nov 14, 2011 · Viewed 26.5k times · Source

I am wondering how much of Ulrich Drepper's What Every Programmer Should Know About Memory from 2007 is still valid. Also I could not find a newer version than 1.0 or an errata.

(Also in PDF form on Ulrich Drepper's own site: https://www.akkadia.org/drepper/cpumemory.pdf)

Answer

Peter Cordes picture Peter Cordes · Dec 8, 2017

The guide in PDF form is at https://www.akkadia.org/drepper/cpumemory.pdf.

It is still generally excellent and highly recommended (by me, and I think by other performance-tuning experts). It would be cool if Ulrich (or anyone else) wrote a 2017 update, but that would be a lot of work (e.g. re-running the benchmarks). See also other x86 performance-tuning and SSE/asm (and C/C++) optimization links in the tag wiki. (Ulrich's article isn't x86 specific, but most (all) of his benchmarks are on x86 hardware.)

The low level hardware details about how DRAM and caches work all still apply. DDR4 uses the same commands as described for DDR1/DDR2 (read/write burst). The DDR3/4 improvements aren't fundamental changes. AFAIK, all the arch-independent stuff still applies generally, e.g. to AArch64 / ARM32.

See also the Latency Bound Platforms section of this answer for important details about the effect of memory/L3 latency on single-threaded bandwidth: bandwidth <= max_concurrency / latency, and this is actually the primary bottleneck for single-threaded bandwidth on a modern many-core CPU like a Xeon. But a quad-core Skylake desktop can come close to maxing out DRAM bandwidth with a single thread. That link has some very good info about NT stores vs. normal stores on x86. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput? is a summary.

Thus Ulrich's suggestion in 6.5.8 Utilizing All Bandwidth about using remote memory on other NUMA nodes as well as your own, is counter-productive on modern hardware where memory controllers have more bandwidth than a single core can use. Well possibly you can imagine a situation where there's a net benefit to running multiple memory-hungry threads on the same NUMA node for low-latency inter-thread communication, but having them use remote memory for high bandwidth not-latency-sensitive stuff. But this is pretty obscure, normally just divide threads between NUMA nodes and have them use local memory. Per-core bandwidth is sensitive to latency because of max-concurrency limits (see below), but all the cores in one socket can usually more than saturate the memory controllers in that socket.


(usually) Don't use software prefetch

One major thing that's changed is that hardware prefetch is much better than on the Pentium 4 and can recognize strided access patterns up to a fairly large stride, and multiple streams at once (e.g. one forward / backward per 4k page). Intel's optimization manual describes some details of the HW prefetchers in various levels of cache for their Sandybridge-family microarchitecture. Ivybridge and later have next-page hardware prefetch, instead of waiting for a cache miss in the new page to trigger a fast-start. I assume AMD has some similar stuff in their optimization manual. Beware that Intel's manual is also full of old advice, some of which is only good for P4. The Sandybridge-specific sections are of course accurate for SnB, but e.g. un-lamination of micro-fused uops changed in HSW and the manual doesn't mention it.

The usual advice these days is to remove all SW prefetch from old code, and only consider putting it back in if profiling shows cache misses (and you're not saturating memory bandwidth). Prefetching both sides of the next step of a binary search can still help. e.g. once you decide which element to look at next, prefetch the 1/4 and 3/4 elements so they can load in parallel with loading/checking middle.

The suggestion to use a separate prefetch thread (6.3.4) is totally obsolete, I think, and was only ever good on Pentium 4. P4 had hyperthreading (2 logical cores sharing one physical core), but not enough trace-cache (and/or out-of-order execution resources) to gain throughput running two full computation threads on the same core. But modern CPUs (Sandybridge-family and Ryzen) are much beefier and should either run a real thread or not use hyperthreading (leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB).

Software prefetch has always been "brittle": the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load. Too early and it's evicted before the demand load. Too late and it doesn't help. This blog article shows code + graphs for an interesting experiment in using SW prefetch on Haswell for prefetching the non-sequential part of a problem. See also How to properly use prefetch instructions?. NT prefetch is interesting, but even more brittle because an early eviction from L1 means you have to go all the way to L3 or DRAM, not just L2. If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory.


Cache line size is still 64 bytes. (L1D read/write bandwidth is very high, and modern CPUs can do 2 vector loads per clock + 1 vector store if it all hits in L1D. See How can cache be that fast?.) With AVX512, line size = vector width, so you can load/store an entire cache line in one instruction. Thus every misaligned load/store crosses a cache-line boundary, instead of every other for 256b AVX1/AVX2, which often doesn't slow down looping over an array that wasn't in L1D.

Unaligned load instructions have zero penalty if the address is aligned at runtime, but compilers (especially gcc) make better code when autovectorizing if they know about any alignment guarantees. Actually unaligned ops are generally fast, but page-splits still hurt (much less on Skylake, though; only ~11 extra cycles latency vs. 100, but still a throughput penalty).


As Ulrich predicted, every multi-socket system is NUMA these days: integrated memory controllers are standard, i.e. there is no external Northbridge. But SMP no longer means multi-socket, because multi-core CPUs are widespread. Intel CPUs from Nehalem to Skylake have used a large inclusive L3 cache as a backstop for coherency between cores. AMD CPUs are different, but I'm not as clear on the details.

Skylake-X (AVX512) no longer has an inclusive L3, but I think there's still a tag directory that lets it check what's cached anywhere on chip (and if so where) without actually broadcasting snoops to all the cores. SKX uses a mesh rather than a ring bus, with generally even worse latency than previous many-core Xeons, unfortunately.

Basically all of the advice about optimizing memory placement still applies, just the details of exactly what happens when you can't avoid cache misses or contention vary.


6.4.2 Atomic ops: the benchmark showing a CAS-retry loop as 4x worse than hardware-arbitrated lock add does probably still reflect a maximum contention case. But in real multi-threaded programs, synchronization is kept to a minimum (because it's expensive), so contention is low and a CAS-retry loop usually succeeds without having to retry.

C++11 std::atomic fetch_add will compile to a lock add (or lock xadd if the return value is used), but an algorithm using CAS to do something that can't be done with a locked instruction is usually not a disaster. Use C++11 std::atomic or C11 stdatomic instead of gcc legacy __sync built-ins or the newer __atomic built-ins unless you want to mix atomic and non-atomic access to the same location...

8.1 DWCAS (cmpxchg16b): You can coax gcc into emitting it, but if you want efficient loads of just one half of the object, you need ugly union hacks: How can I implement ABA counter with c++11 CAS?. (Don't confuse DWCAS with DCAS of 2 separate memory locations. Lock-free atomic emulation of DCAS isn't possible with DWCAS, but transactional memory (like x86 TSX) makes it possible.)

8.2.4 transactional memory: After a couple false starts (released then disabled by a microcode update because of a rarely-triggered bug), Intel has working transactional memory in late-model Broadwell and all Skylake CPUs. The design is still what David Kanter described for Haswell. There's a lock-ellision way to use it to speed up code that uses (and can fall back to) a regular lock (especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide), or to write code that knows about transactions directly.


7.5 Hugepages: anonymous transparent hugepages work well on Linux without having to manually use hugetlbfs. Make allocations >= 2MiB with 2MiB alignment (e.g. posix_memalign, or an aligned_alloc that doesn't enforce the stupid ISO C++17 requirement to fail when size % alignment != 0).

A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads (e.g. that keep using large allocations for a while after making them) may benefit from
echo always >/sys/kernel/mm/transparent_hugepage/defrag to get the kernel to defrag physical memory whenever needed, instead of falling back to 4k pages. (See the kernel docs). Alternatively, use madvise(MADV_HUGEPAGE) after making large allocations (preferably still with 2MiB alignment).


Appendix B: Oprofile: Linux perf has mostly superseded oprofile. For detailed events specific to certain microarchitectures, use the ocperf.py wrapper. e.g.

ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,\
branches,branch-misses,instructions,uops_issued.any,\
uops_executed.thread,idq_uops_not_delivered.core -r2 ./a.out

For some examples of using it, see Can x86's MOV really be "free"? Why can't I reproduce this at all?.