AMD has an ABI specification that describes the calling convention to use on x86-64. All OSes follow it, except for Windows which has it's own x86-64 calling convention. Why?
Does anyone know the technical, historical, or political reasons for this difference, or is it purely a matter of NIHsyndrome?
I understand that different OSes may have different needs for higher level things, but that doesn't explain why for example the register parameter passing order on Windows is rcx - rdx - r8 - r9 - rest on stack
while everyone else uses rdi - rsi - rdx - rcx - r8 - r9 - rest on stack
.
P.S. I am aware of how these calling conventions differ generally and I know where to find details if I need to. What I want to know is why.
Edit: for the how, see e.g. the wikipedia entry and links from there.
One of the things to keep in mind about x86 is that the register name to "reg number" encoding is not obvious; in terms of instruction encoding (the MOD R/M byte, see http://www.c-jump.com/CIS77/CPU/x86/X77_0060_mod_reg_r_m_byte.htm), register numbers 0...7 are - in that order - ?AX
, ?CX
, ?DX
, ?BX
, ?SP
, ?BP
, ?SI
, ?DI
.
Hence choosing A/C/D (regs 0..2) for return value and the first two arguments (which is the "classical" 32bit __fastcall
convention) is a logical choice. As far as going to 64bit is concerned, the "higher" regs are ordered, and both Microsoft and UN*X/Linux went for R8
/ R9
as the first ones.
Keeping that in mind, Microsoft's choice of RAX
(return value) and RCX
, RDX
, R8
, R9
(arg[0..3]) are an understandable selection if you choose four registers for arguments.
I don't know why the AMD64 UN*X ABI chose RDX
before RCX
.
UN*X, on RISC architectures, has traditionally done argument passing in registers - specifically, for the first six arguments (that's so on PPC, SPARC, MIPS at least). Which might be one of the major reasons why the AMD64 (UN*X) ABI designers chose to use six registers on that architecture as well.
So if you want six registers to pass arguments in, and it's logical to choose RCX
, RDX
, R8
and R9
for four of them, which other two should you pick ?
The "higher" regs require an additional instruction prefix byte to select them and therefore have a bigger instruction size footprint, so you wouldn't want to choose any of those if you have options. Of the classical registers, due to the implicit meaning of RBP
and RSP
these aren't available, and RBX
traditionally has a special use on UN*X (global offset table) which seemingly the AMD64 ABI designers didn't want to needlessly become incompatible with.
Ergo, the only choice were RSI
/ RDI
.
So if you have to take RSI
/ RDI
as argument registers, which arguments should they be ?
Making them arg[0]
and arg[1]
has some advantages. See cHao's comment.
?SI
and ?DI
are string instruction source / destination operands, and as cHao mentioned, their use as argument registers means that with the AMD64 UN*X calling conventions, the simplest possible strcpy()
function, for example, only consists of the two CPU instructions repz movsb; ret
because the source/target addresses have been put into the correct registers by the caller. There is, particularly in low-level and compiler-generated "glue" code (think, for example, some C++ heap allocators zero-filling objects on construction, or the kernel zero-filling heap pages on sbrk()
, or copy-on-write pagefaults) an enormous amount of block copy/fill, hence it'll be useful for code so frequently used to save the two or three CPU instructions that'd otherwise load such source/target address arguments into the "correct" registers.
So in a way, UN*X and Win64 are only different in that UN*X "prepends" two additional arguments, in purposefully chosen RSI
/RDI
registers, to the natural choice of four arguments in RCX
, RDX
, R8
and R9
.
There are more differences between the UN*X and Windows x64 ABIs than just the mapping of arguments to specific registers. For the overview on Win64, check:
http://msdn.microsoft.com/en-us/library/7kcdt6fy.aspx
Win64 and AMD64 UN*X also strikingly differ in the way stackspace is used; on Win64, for example, the caller must allocate stackspace for function arguments even though args 0...3 are passed in registers. On UN*X on the other hand, a leaf function (i.e. one that doesn't call other functions) is not even required to allocate stackspace at all if it needs no more than 128 Bytes of it (yes, you own and can use a certain amount of stack without allocating it ... well, unless you're kernel code, a source of nifty bugs). All these are particular optimization choices, most of the rationale for those is explained in the full ABI references that the original poster's wikipedia reference points to.