Chapter 3 of Computer Systems A Programmer's Perspective (2nd Edition) mentions that
cltq
is equivalent to movslq %eax, %rax
.
Why did they create a new instruction (cltq
) instead of just using movslq %eax,%rax
? Isn't that redundant?
TL;DR: use cltq
(aka cdqe
) when possible, because it's one byte shorter than the exactly-equivalent movslq %eax, %rax
. That's a very minor advantage (so don't sacrifice anything else to make this happen) but choose eax
if you're going to want to sign-extend it a lot.
This is mostly relevant for compiler-writers (compiling signed-integer loop counters indexing arrays); stuff like sign-extending a loop counter every iteration only happens when compilers don't manage to take advantage of signed overflow being undefined behaviour to avoid it. Human programmers will just decide what's signed vs. unsigned to save instructions.
(sign-extending into a different register with movsx
/ movslq
can avoid lengthening the dependency chain for the 32-bit value, relevant if it's updated in a loop.)
Related: complete run-down on Intel vs. AT&T mnemonics for the different sizes of the instructions that sign-extend within RAX (cltq
), or from EAX into EDX:EAX (cltd
), with the equivalent movsx
/ movs?t?
: What does cltq do in assembly?.
Actually, the 32->64 bit form of MOVSX (called movslq
in AT&T syntax), is the new one, new with AMD64. The Intel-syntax mnemonic is actually MOVSXD. The opcode is 63 /r
(so it's 3 bytes including the necessary REX prefix, vs. 4 bytes for 8->64 or 16->64 MOVSX). AMD repurposed the opcode from ARPL, which doesn't exist in 64-bit mode.
To understand the history, remember that current x86 wasn't designed all at once. First there was 16-bit 8086, with not MOVSZ/MOVZX at all, just CBW and CWD. Then 386 added MOVS/ZX (and wider versions of CBW/CWD for sign-extending within eax or into edx). Then AMD extended all of that to 64-bit.
The REX versions of the existing MOVSX opcodes still have an 8 or 16bit source, but sign extend all the way to 64 bits instead of just 32. The operand-size prefix lets you encode movsbw
, aka movsx r16, r/m8
. IDK what happens if you use an operand-size prefix and REX.W at the same time. Or what happens if you use an operand-size prefix with the 16bit source form of MOVSX. Probably it's just an expensive way to encode MOV, like using 63 /r
without a REX prefix (which the Intel's insn set manual recommends against).
cltq
(aka CDQE) is just the obvious way to extend the existing cwtl
(aka CWDE) with a REX.W prefix to promote the operand-size to 64 bits. The original form of this, cbtw
(aka CBW), was in 8086, predating MOVSX, and was the only sane way to sign-extend anything. Since shifts with immediate count>1 were a 186 feature, the least bad other option seems to be mov ah, al
/ mov cl, 7
/ sar ah, cl
to broadcast the sign bit to all positions.
Also, don't confuse cwtl
with cwtd
(aka CWD: sign extend ax into dx:ax, e.g. to set up for idiv).
The AT&T mnemonics are pretty horrible here. l
vs. d
, really? The Intel mnemonics all have e
on the end for the ones that extend within rax, and not for the ones that extend into (part of) rdx. Except for CBW, but of course that extends al into ax, because even 8086 had 16bit registers, so never needed to store 16bit values in dl:al. idiv r/m8
uses ax as a source reg, not dl:al (and puts the results in ah, al)).
redundancies
Yes, this is one of many redundancies in x86 assembly language. e.g. sub eax,eax
to zero rax vs. xor eax,eax
. (mov eax,0
isn't totally redundant, because it doesn't affect flags. If you include slight differences like that as redundant, or even instructions that run on different execution ports, there are lots of ways to do some things.).
If I had the chance to modify the x86-64 ISA, I would probably give MOVZX and MOVSX single-byte opcodes (instead of 0F XX
two-byte escaped opcodes), at least the 8-bit-source versions. So movsx eax, byte [mem]
would be as compact as mov al, [mem]
. (They're already the same performance on Intel CPUs: handled entirely in the load port, with no ALU uop). Most real code fails to take advantage of [u]int16_t
arrays for higher cache density, so I think movs/zx from word to dword or qword is rarer. Or maybe there's enough wide-character code around to justify shorter opcodes for MOVZX r32/r64, r/m16
. To make some room, we can drop the CBW / CWDE / CDQE opcode entirely. I might keep CWD / CDQ / CQO as a useful setup for idiv, which has no one-instruction equivalent.
In reality, probably having fewer single-byte opcodes and more escape prefixes would be a lot more useful (e.g. so common SSE2 insns can be 2 opcode bytes + ModRM, instead of the usual 3 or 4 opcode bytes). Instruction-decoding is less of a bottleneck with shorter instructions in high-performance loops. But if x86-64 machine code is too different from 32-bit, we need extra decode transistors. That may be ok now that power limitations have made dark silicon a thing, because a core would never need its 32-bit decoder powered on at the same time as its 64-bit decoder. That wasn't the case when AMD was designing AMD64. (err, HyperThreading alternating cycles between logical threads running in 32-bit and 64-bit would stop you from fully shutting down either, if they were separate.)
Instead of CDQ, we could made two-operand shift instructions, with a non-destructive destination, so sar edx, eax, 31
would do CDQ in 3 bytes. Dropping the one-byte xchg-with-eax opcodes (other than 0x90 xchg eax,eax
NOP) would free up lots of coding space for sar, shr, shl without needing the Reg field of the ModRM as extra opcode bits. And of course remove the doesn't-affect-flags special case for shift_count=0 to kill the input dependency on FLAGS).
(I'd also have changed setcc r/m8
to setcc r/m32
. Or maybe setcc r32/m8
. (Memory dst uses a separate ALU uop anyway, so it could decode as setcc tmp32 and store the low 8 of that). It's almost always used by xor-zeroing a destination, and you have to juggle that vs. the flag-setting.)
AMD had the chance to do (some of) this with AMD64, but chose to be conservative to share as many instruction-decode transistors as possible. (Can't really fault them for that, but it's unfortunate that political/economic circumstances resulted in x86 missing its only chance for the foreseeable future to drop some of its legacy baggage.) It also meant less work modifying code generation / analysis software, but that's a one-time cost and small potatoes compared to potentially making every x86-64 CPU run faster and have smaller binaries.
See also the x86 tag wiki for more links, including this old appendix from the NASM manual documenting when every form of every instruction was introduced.