How does CLFLUSH work for an address that is not in cache yet?

Mike picture Mike · Mar 9, 2016 · Viewed 7.3k times · Source

We are trying to use the Intel CLFLUSH instruction to flush the cache content of a process in Linux at the userspace.

We create a very simple C program that first access a large array and then call the CLFLUSH to flush the virtual address space of the whole array. We measure the latency it takes for CLFLUSH to flush the whole array. The size of the array in the program is an input and we vary the input from 1MB to 40MB with a step of 2MB.

In our understanding, the CLFLUSH should flush the content in the cache. So we expect to see the latency of flushing the whole array first increase linearly in terms of the size of the array, and then the latency should stop increasing after the array size is larger than 20MB, which is the size of the LLC of our program.

However, the experiment result is quite surprising, as shown in the figure. The latency does not stop increasing after the array size is larger than 20MB.

We are wondering if the CLFLUSH could potentially bring in the address before CLFLUSH flushes the address out of the cache, if the address is not in the cache yet? We also tried to search in the Intel software developer manual, and didn't find any explanation of what CLFLUSH will do if an address is not in the cache.

enter image description here

Below is the data we used to draw the figure. The first column is the size of the array in KB, and the second column is the latency of flushing the whole array in seconds.

Any suggestion/advice is more than appreciated.

[Modified]

The previous code is unnecessary. CLFLUSH can be done in userspace much easier, although it has the similar performance. So I deleted the messy code to avoid confusion.

SCENARIO=Read Only
1024,.00158601000000000000
3072,.00299244000000000000
5120,.00464945000000000000
7168,.00630479000000000000
9216,.00796194000000000000
11264,.00961576000000000000
13312,.01126760000000000000
15360,.01300500000000000000
17408,.01480760000000000000
19456,.01696180000000000000
21504,.01968410000000000000
23552,.02300760000000000000
25600,.02634970000000000000
27648,.02990350000000000000
29696,.03403090000000000000
31744,.03749210000000000000
33792,.04092470000000000000
35840,.04438390000000000000
37888,.04780050000000000000
39936,.05163220000000000000

SCENARIO=Read and Write
1024,.00200558000000000000
3072,.00488687000000000000
5120,.00775943000000000000
7168,.01064760000000000000
9216,.01352920000000000000
11264,.01641430000000000000
13312,.01929260000000000000
15360,.02217750000000000000
17408,.02516330000000000000
19456,.02837180000000000000
21504,.03183180000000000000
23552,.03509240000000000000
25600,.03845220000000000000
27648,.04178440000000000000
29696,.04519920000000000000
31744,.04858340000000000000
33792,.05197220000000000000
35840,.05526950000000000000
37888,.05865630000000000000
39936,.06202170000000000000

Answer

Peter Cordes picture Peter Cordes · Mar 13, 2016

This doesn't explain the knee in the read-only graph, but does explain why it doesn't plateau.


I didn't get around to testing locally to look into the difference between the hot and cold cache case, but I did come across a performance number for clflush:

This AIDA64 instruction latency/throughput benchmark repository lists a single-socket Haswell-E CPU (i7-5820K) as having a clflush throughput of one per ~99.08 cycles. It doesn't say whether that's for the same address repeatedly, or what.

So clflush isn't anywhere near free even when it doesn't have to do any work. It's still a microcoded instruction, not heavily optimized because it's usually not a big part of the CPUs workload.

Skylake is getting ready for that to change, with support for persistent memory connected to the memory controller: On Skylake (i5-6400T), measured throughput was:

  • clflush: one per ~66.42cycles
  • clflushopt: one per ~56.33cycles

Perhaps clflushopt is more of a win when some of the lines are actually dirty cache that needs flushing, maybe when L3 is busy from other cores doing the same thing. Or maybe they just want to get software using the weakly-ordered version ASAP, before making even bigger improvements to throughput. It's ~15% faster in this case, which is not bad.