As it is widely advertised, modern x86_64 processors have 64-bit registers that can be used in backward-compatible fashion as 32-bit registers, 16-bit registers and even 8-bit registers, for example:
0x1122334455667788
================ rax (64 bits)
======== eax (32 bits)
==== ax (16 bits)
== ah (8 bits)
== al (8 bits)
Such a scheme may be taken literally, i.e. one can always access only the part of the register using a designated name for reading or writing purposes, and it would be highly logical. In fact, this is true for everything up to 32-bit:
mov eax, 0x11112222 ; eax = 0x11112222
mov ax, 0x3333 ; eax = 0x11113333 (works, only low 16 bits changed)
mov al, 0x44 ; eax = 0x11113344 (works, only low 8 bits changed)
mov ah, 0x55 ; eax = 0x11115544 (works, only high 8 bits changed)
xor ah, ah ; eax = 0x11110044 (works, only high 8 bits cleared)
mov eax, 0x11112222 ; eax = 0x11112222
xor al, al ; eax = 0x11112200 (works, only low 8 bits cleared)
mov eax, 0x11112222 ; eax = 0x11112222
xor ax, ax ; eax = 0x11110000 (works, only low 16 bits cleared)
However, things seem to be fairly awkward as soon as we get to 64-bit stuff:
mov rax, 0x1111222233334444 ; rax = 0x1111222233334444
mov eax, 0x55556666 ; actual: rax = 0x0000000055556666
; expected: rax = 0x1111222255556666
; upper 32 bits seem to be lost!
mov rax, 0x1111222233334444 ; rax = 0x1111222233334444
mov ax, 0x7777 ; rax = 0x1111222233337777 (works!)
mov rax, 0x1111222233334444 ; rax = 0x1111222233334444
xor eax, eax ; actual: rax = 0x0000000000000000
; expected: rax = 0x1111222200000000
; again, it wiped whole register
Such behavior seems to be highly ridiculous and illogical to me. It looks like trying to write anything at all to eax
by any means leads to wiping of high 32 bits of rax
register.
So, I have 2 questions:
I believe that this awkward behavior must be documented somewhere, but I can't seem to find detailed explanation (of how exactly high 32 bits of 64-bit register get wiped) anywhere. Am I right that writing to eax
always wipes rax
, or it's something more complicated? Does it apply to all 64-bit registers, or there are some exceptions?
A strongly related question mentions the same behavior, but, alas, there are again no exact references to documentation.
In other words, I'd like a link to documentation that specifies this behavior.
Is it just me or this whole thing seems to be really weird and illogical (i.e. eax-ax-ah-al, rax-ax-ah-al having one behavior and rax-eax having another)? May be I'm missing some kind of vital point here on why was it implemented like that?
An explanation on "why" would be highly appreciated.
The processor model as documented in the Intel/AMD processor manual is a pretty imperfect model for the real execution engine of a modern core. In particular, the notion of the processor registers does not match reality, there is no such thing as a EAX or RAX register.
One primary job of the instruction decoder is to convert the legacy x86/x64 instructions into micro-ops, instructions of a RISC-like processor. Small instructions that are easy to execute concurrently and being able to take advantage of multiple execution sub-units. Allowing as many as 6 instructions to execute at the same time.
To make that work, the notion of processor registers is virtualized as well. The instruction decoder allocates a register from a big bank of registers. When the instruction is retired, the value of that dynamically allocated register is written back to whatever register currently holds the value of, say, RAX.
To make that work smoothly and efficiently, allowing many instructions to execute concurrently, it is very important that these operations don't have an interdependency. And the worst kind you can have is that the register value depends on other instructions. The EFLAGS register is notorious, many instructions modify it.
Same problem with the way you like it to work. Big problem, it requires two register values to be merged when the instruction is retired. Creating a data dependency that's going to clog up the core. By forcing the upper 32-bit to 0, that dependency instantly disappears, no longer a need to merge. Warp 9 execution speed.