I am trying to create a dumb version of a spin lock. Browsing the web, I came across a assembly instruction called "PAUSE" in x86 which is used to give hint to a processor that a spin-lock is currently running on this CPU. The intel manual and other information available state that
The processor uses this hint to avoid the memory order violation in most situations, which greatly improves processor performance. For this reason, it is recommended that a PAUSE instruction be placed in all spin-wait loops. The documentation also mentions that "wait(some delay)" is the pseudo implementation of the instruction.
The last line of the above paragraph is intuitive. If I am unsuccessful in grabbing the lock, I must wait for some time before grabbing the lock again.
However, what do we mean by memory order violation in case of a spin lock? Does "memory order violation" mean the incorrect speculative load/store of the instructions after spin-lock?
The spin-lock question has been asked on Stack overflow before but the memory order violation question remains unanswered (at-least for my understanding).
Just imagine, how the processor would execute a typical spin-wait loop:
1 Spin_Lock:
2 CMP lockvar, 0 ; Check if lock is free
3 JE Get_Lock
4 JMP Spin_Lock
5 Get_Lock:
After a few iterations the branch predictor will predict that the conditional branch (3) will never be taken and the pipeline will fill with CMP instructions (2). This goes on until finally another processor writes a zero to lockvar. At this point we have the pipeline full of speculative (i.e. not yet committed) CMP instructions some of which already read lockvar and reported an (incorrect) nonzero result to the following conditional branch (3) (also speculative). This is when the memory order violation happens. Whenever the processor "sees" an external write (a write from another processor), it searches in its pipeline for instructions which speculatively accessed the same memory location and did not yet commit. If any such instructions are found then the speculative state of the processor is invalid and is erased with a pipeline flush.
Unfortunately this scenario will (very likely) repeat each time a processor is waiting on a spin-lock and make these locks much slower than they ought to be.
Enter the PAUSE instruction:
1 Spin_Lock:
2 CMP lockvar, 0 ; Check if lock is free
3 JE Get_Lock
4 PAUSE ; Wait for memory pipeline to become empty
5 JMP Spin_Lock
6 Get_Lock:
The PAUSE instruction will "de-pipeline" the memory reads, so that the pipeline is not filled with speculative CMP (2) instructions like in the first example. (I.e. it could block the pipeline until all older memory instructions are committed.) Because the CMP instructions (2) execute sequentially it is unlikely (i.e. the time window is much shorter) that an external write occurs after the CMP instruction (2) read lockvar but before the CMP is committed.
Of course "de-pipelining" will also waste less energy in the spin-lock and in case of hyperthreading it will not waste resources the other thread could use better. On the other hand there is still a branch mis-prediction waiting to occur before each loop exit. Intel's documentation does not suggest that PAUSE eliminates that pipeline flush, but who knows...