How does one write code that best utilizes the CPU cache to improve performance?

goldenmean picture goldenmean · Apr 18, 2009 · Viewed 66.4k times · Source

This could sound like a subjective question, but what I am looking for are specific instances, which you could have encountered related to this.

  1. How to make code, cache effective/cache friendly (more cache hits, as few cache misses as possible)? From both perspectives, data cache & program cache (instruction cache), i.e. what things in one's code, related to data structures and code constructs, should one take care of to make it cache effective.

  2. Are there any particular data structures one must use/avoid, or is there a particular way of accessing the members of that structure etc... to make code cache effective.

  3. Are there any program constructs (if, for, switch, break, goto,...), code-flow (for inside an if, if inside a for, etc ...) one should follow/avoid in this matter?

I am looking forward to hearing individual experiences related to making cache efficient code in general. It can be any programming language (C, C++, Assembly, ...), any hardware target (ARM, Intel, PowerPC, ...), any OS (Windows, Linux,S ymbian, ...), etc..

The variety will help to better to understand it deeply.

Answer

Mats N picture Mats N · Jun 19, 2009

The cache is there to reduce the number of times the CPU would stall waiting for a memory request to be fulfilled (avoiding the memory latency), and as a second effect, possibly to reduce the overall amount of data that needs to be transfered (preserving memory bandwidth).

Techniques for avoiding suffering from memory fetch latency is typically the first thing to consider, and sometimes helps a long way. The limited memory bandwidth is also a limiting factor, particularly for multicores and multithreaded applications where many threads wants to use the memory bus. A different set of techniques help addressing the latter issue.

Improving spatial locality means that you ensure that each cache line is used in full once it has been mapped to a cache. When we have looked at various standard benchmarks, we have seen that a surprising large fraction of those fail to use 100% of the fetched cache lines before the cache lines are evicted.

Improving cache line utilization helps in three respects:

  • It tends to fit more useful data in the cache, essentially increasing the effective cache size.
  • It tends to fit more useful data in the same cache line, increasing the likelyhood that requested data can be found in the cache.
  • It reduces the memory bandwidth requirements, as there will be fewer fetches.

Common techniques are:

  • Use smaller data types
  • Organize your data to avoid alignment holes (sorting your struct members by decreasing size is one way)
  • Beware of the standard dynamic memory allocator, which may introduce holes and spread your data around in memory as it warms up.
  • Make sure all adjacent data is actually used in the hot loops. Otherwise, consider breaking up data structures into hot and cold components, so that the hot loops use hot data.
  • avoid algorithms and datastructures that exhibit irregular access patterns, and favor linear datastructures.

We should also note that there are other ways to hide memory latency than using caches.

Modern CPU:s often have one or more hardware prefetchers. They train on the misses in a cache and try to spot regularities. For instance, after a few misses to subsequent cache lines, the hw prefetcher will start fetching cache lines into the cache, anticipating the application's needs. If you have a regular access pattern, the hardware prefetcher is usually doing a very good job. And if your program doesn't display regular access patterns, you may improve things by adding prefetch instructions yourself.

Regrouping instructions in such a way that those that always miss in the cache occur close to each other, the CPU can sometimes overlap these fetches so that the application only sustain one latency hit (Memory level parallelism).

To reduce the overall memory bus pressure, you have to start addressing what is called temporal locality. This means that you have to reuse data while it still hasn't been evicted from the cache.

Merging loops that touch the same data (loop fusion), and employing rewriting techniques known as tiling or blocking all strive to avoid those extra memory fetches.

While there are some rules of thumb for this rewrite exercise, you typically have to carefully consider loop carried data dependencies, to ensure that you don't affect the semantics of the program.

These things are what really pays off in the multicore world, where you typically wont see much of throughput improvements after adding the second thread.