LOOP (Intel ref manual entry)
decrements ecx / rcx, and then jumps if non-zero. It's slow, but couldn't Intel have cheaply made it fast? dec/jnz
already macro-fuses into a single uop on Sandybridge-family; the only difference being that that sets flags.
loop
on various microarchitectures, from Agner Fog's instruction tables:
Bulldozer-family/Ryzen: 1 m-op (same cost as macro-fused test-and-branch, or jecxz
)
P4: 4 uops (same as jecxz
)
loope
/ loopne
). Throughput = 4c (loop
) or 7c (loope/ne
).loope
/ loopne
). Throughput = one per 5 cycles, as much of a bottleneck as keeping your loop counter in memory! jecxz
is only 2 uops with same throughput as regular jcc
Couldn't the decoders just decode the same as lea rcx, [rcx-1]
/ jrcxz
? That would be 3 uops. At least that would be the case with no address-size prefix, otherwise it has to use ecx
and truncate RIP
to EIP
if the jump is taken; maybe the odd choice of address-size controlling the width of the decrement explains the many uops?
Or better, just decode it as a fused dec-and-branch that doesn't set flags? dec ecx
/ jnz
on SnB decodes to a single uop (which does set flags).
I know that real code doesn't use it (because it's been slow since at least P5 or something), but AMD decided it was worth it to make it fast for Bulldozer. Probably because it was easy.
Would it be easy for SnB-family uarch to have fast loop
? If so, why don't they? If not, why is it hard? A lot of decoder transistors? Or extra bits in a fused dec&branch uop to record that it doesn't set flags? What could those 7 uops be doing? It's a really simple instruction.
What's special about Bulldozer that made a fast loop
easy / worth it? Or did AMD waste a bunch of transistors on making loop
fast? If so, presumably someone thought it was a good idea.
If loop
was fast, it would be perfect for BigInteger arbitrary-precision adc
loops, to avoid partial-flag stalls / slowdowns (see my comments on my answer), or any other case where you want to loop without touching flags. It also has a minor code-size advantage over dec/jnz
. (And dec/jnz
only macro-fuses on SnB-family).
On modern CPUs where dec/jnz
is ok in an ADC loop, loop
would still be nice for ADCX / ADOX loops (to preserve OF).
If loop
had been fast, compilers would already be using it as a peephole optimization for code-size + speed on CPUs without macro-fusion.
It wouldn't stop me from getting annoyed at all the questions with bad 16bit code that uses loop
for every loop, even when they also need another counter inside the loop. But at least it wouldn't be as bad.
Now that I googled after writing my question, it turns out to be an exact duplicate of one on comp.arch, which came up right away. I expected it to be hard to google (lots of "why is my loop slow" hits), but my first try (why is the x86 loop instruction slow
) got results.
It might be the best we'll get, and will have to suffice unless someone can shed some more light on it. I didn't set out to write this as an answer-my-own-question post.
Good posts with different theories in that thread:
LOOP became slow on some of the earliest machines (circa 486) when significant pipelining started to happen, and running any but the simplest instruction down the pipeline efficiently was technologically impractical. So LOOP was slow for a number of generations. So nobody used it. So when it became possible to speed it up, there was no real incentive to do so, since nobody was actually using it.
IIRC LOOP was used in some software for timing loops; there was (important) software that did not work on CPUs where LOOP was too fast (this was in the early 90s or so). So CPU makers learned to make LOOP slow.
(Paul, and anyone else: You're welcome to re-post your own writing as your own answer. I'll remove it from my answer and up-vote yours.)
@Paul A. Clayton (occasional SO poster and CPU architecture guy) took a guess at how you could use that many uops. (This looks like loope/ne
which checks both the counter and ZF):
I could imagine a possibly sensible 6-µop version:
virtual_cc = cc; temp = test (cc); rCX = rCX - temp; // also setting cc cc = temp & cc; // assumes branch handling is not // substantially changed for the sake of LOOP branch cc = virtual_cc
(Note that this is 6 uops, not SnB's 11 for LOOPE/LOOPNE, and is a total guess not even trying to take into account anything known from SnB perf counters.)
Then Paul said:
I agree that a shorter sequence should be possible, but I was trying to think of a bloated sequence that might make sense if minimal microarchitectural adjustments were permitted.
summary: The designers wanted loop
to be supported only via microcode, with no adjustments whatsoever to the hardware proper.
If a useless, compatibility-only instruction is handed to the microcode developers, they might reasonably not be able or willing to suggest minor changes to the internal microarchitecture to improve such an instruction. Not only would they rather use their "change suggestion capital" more productively but the suggestion of a change for a useless case would reduce the credibility of other suggestions.
(My opinion: Intel is probably still making it slow on purpose, and hasn't bothered to rewrite their microcode for it for a long time. Modern CPUs are probably too fast for anything using loop
in a naive way to work correctly.)
... Paul continues:
The architects behind Nano may have found avoiding the special casing of LOOP simplified their design in terms of area or power. Or they may have had incentives from embedded users to provide a fast implementation (for code density benefits). Those are just WILD guesses.
If optimization of LOOP fell out of other optimizations (like fusion of compare and branch), it might be easier to tweak LOOP into a fast path instruction than to handle it in microcode even if the performance of LOOP was unimportant.
I suspect that such decisions are based on specific details of the implementation. Information about such details does not seem to be generally available and interpreting such information would be beyond the skill level of most people. (I am not a hardware designer--and have never played one on television or stayed at a Holiday Inn Express. :-)
The thread then went off-topic into the realm of AMD blowing our one chance to clean up the cruft in x86 instruction encoding. It's hard to blame them, since every change is a case where the decoders can't share transistors. And before Intel adopted x86-64, it wasn't even clear that it would catch on. AMD didn't want to burden their CPUs with hardware nobody used if AMD64 didn't catch on.
But still, there are so many small things: setcc
could have changed to 32bits. (Usually you have to use xor-zero / test / setcc to avoid false dependencies, or because you need a zero-extended reg). Shift could have unconditionally written flags, even with zero shift count (removing the input data dependency on eflags for variable-count shift for OOO execution). Last time I typed this list of pet peeves, I think there was a third one... Oh yeah, bt
/ bts
etc. with memory operands has the address dependent on the upper bits of the index (bit string, not just bit within a machine word).
bts
instructions are very useful for bit-field stuff, and are slower than they need to be so you almost always want to load into a register and then use that. (It's usually faster to shift/mask to get an address yourself, instead of using 10 uop bts [mem], reg
on Skylake, but it does take extra instructions. So it made sense on 386, but not on K8). Atomic bit-manipulation has to use the memory-dest form, but the lock
ed version needs lots of uops anyway. It's still slower than if it couldn't access outside the dword
it's operating on.