Intel x86 vs x64 system call

becks picture becks · Mar 2, 2013 · Viewed 11.6k times · Source

I'm reading about the difference in assembly between x86 and x64.

On x86, the system call number is placed in eax, then int 80h is executed to generate a software interrupt.

But on x64, the system call number is placed in rax, then syscall is executed.

I'm told that syscall is lighter and faster than generating a software interrupt.

Why it is faster on x64 than x86, and can I make a system call on x64 using int 80h?

Answer

mikyra picture mikyra · Mar 2, 2013

General part

EDIT: Linux irrelevant parts removed

While not totally wrong, narrowing down to int 0x80 and syscall oversimplifies the question as with sysenter there is at least a 3rd option.

Using 0x80 and eax for syscall number, ebx, ecx, edx, esi, edi, and ebp to pass parameters is just one of many possible other choices to implement a system call, but those registers are the ones the 32-bit Linux ABI chose.

Before taking a closer look at the techniques involved, it should be stated that they all circle around the problem of escaping the privilege prison every process runs in.

Another choice to the ones presented here offered by the x86 architecture would have been the use of a call gate (see: http://en.wikipedia.org/wiki/Call_gate)

The only other possibility present on all i386 machines is using a software interrupt, which allows the ISR (Interrupt Service Routine or simply an interrupt handler) to run at a different privilege level than before.

(Fun fact: some i386 OSes have used an invalid-instruction exception to enter the kernel for system calls, because that was actually faster than an int instruction on 386 CPUs. See OsDev syscall/sysret and sysenter/sysexit instructions enabling for a summary of possible system-call mechanisms.)

Software Interrupt

What exactly happens once an interrupt is triggered depends on whether switching to the ISR requires a privilege change or not:

(Intel® 64 and IA-32 Architectures Software Developer’s Manual)

6.4.1 Call and Return Operation for Interrupt or Exception Handling Procedures

...

If the code segment for the handler procedure has the same privilege level as the currently executing program or task, the handler procedure uses the current stack; if the handler executes at a more privileged level, the processor switches to the stack for the handler’s privilege level.

....

If a stack switch does occur, the processor does the following:

  1. Temporarily saves (internally) the current contents of the SS, ESP, EFLAGS, CS, and > EIP registers.

  2. Loads the segment selector and stack pointer for the new stack (that is, the stack for the privilege level being called) from the TSS into the SS and ESP registers and switches to the new stack.

  3. Pushes the temporarily saved SS, ESP, EFLAGS, CS, and EIP values for the interrupted procedure’s stack onto the new stack.

  4. Pushes an error code on the new stack (if appropriate).

  5. Loads the segment selector for the new code segment and the new instruction pointer (from the interrupt gate or trap gate) into the CS and EIP registers, respectively.

  6. If the call is through an interrupt gate, clears the IF flag in the EFLAGS register.

  7. Begins execution of the handler procedure at the new privilege level.

... sigh this seems to be a lot to do and even once we're done it doesn't get too much better:

(excerpt taken from the same source as mentioned above: Intel® 64 and IA-32 Architectures Software Developer’s Manual)

When executing a return from an interrupt or exception handler from a different privilege level than the interrupted procedure, the processor performs these actions:

  1. Performs a privilege check.

  2. Restores the CS and EIP registers to their values prior to the interrupt or exception.

  3. Restores the EFLAGS register.

  4. Restores the SS and ESP registers to their values prior to the interrupt or exception, resulting in a stack switch back to the stack of the interrupted procedure.

  5. Resumes execution of the interrupted procedure.

Sysenter

Another option on the 32-bit platform not mentioned in your question at all, but nevertheless utilized by the Linux kernel is the sysenter instruction.

(Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z)

Description Executes a fast call to a level 0 system procedure or routine. SYSENTER is a companion instruction to SYSEXIT. The instruction is optimized to provide the maximum performance for system calls from user code running at privilege level 3 to operating system or executive procedures running at privilege level 0.

One disadvantage of using this solution is, that it is not present on all 32-bit machines, so the int 0x80 method still has to be provided in case the CPU doesn't know about it.

The SYSENTER and SYSEXIT instructions were introduced into the IA-32 architecture in the Pentium II processor. The availability of these instructions on a processor is indicated with the SYSENTER/SYSEXIT present (SEP) feature flag returned to the EDX register by the CPUID instruction. An operating system that qualifies the SEP flag must also qualify the processor family and model to ensure that the SYSENTER/SYSEXIT instructions are actually present

Syscall

The last possibility, the syscall instruction, pretty much allows for the same functionality as the sysenter instruction. The existence of both is due to the fact that one (systenter) was introduced by Intel while the other (syscall) was introduced by AMD.

Linux specific

In the Linux kernel any of the three possibilities mentioned above may be chosen to realize a system call.

See also The Definitive Guide to Linux System Calls.

As already stated above, the int 0x80 method is the only one of the 3 chosen implementations, that can run on any i386 CPU so this is the only one that is always available for 32-bit user-space.

(syscall is the only one that's always available for 64-bit user-space, and the only one you should ever use in 64-bit code; x86-64 kernels can be built without CONFIG_IA32_EMULATION, and int 0x80 still invokes the 32-bit ABI which truncates pointers to 32-bit.)

To allow to switch between all 3 choices every process run is given access to a special shared object that gives access to the system call implementation chosen for the running system. This is the strange looking linux-gate.so.1 you already might have encountered as unresolved library when using ldd or the like.

(arch/x86/vdso/vdso32-setup.c)

 if (vdso32_syscall()) {                                                                               
        vsyscall = &vdso32_syscall_start;                                                                 
        vsyscall_len = &vdso32_syscall_end - &vdso32_syscall_start;                                       
    } else if (vdso32_sysenter()){                                                                        
        vsyscall = &vdso32_sysenter_start;                                                                
        vsyscall_len = &vdso32_sysenter_end - &vdso32_sysenter_start;                                     
    } else {                                                                                              
        vsyscall = &vdso32_int80_start;                                                                   
        vsyscall_len = &vdso32_int80_end - &vdso32_int80_start;                                           
    }   

To utilize it all you have to do is load all your registers system call number in eax, parameters in ebx, ecx, edx, esi, edi as with int 0x80 system call implementation and call the main routine.

Unfortunately it is not all that easy; as to minimize the security risk of a fixed predefined address, the location at which the vdso (virtual dynamic shared object) will be visible in a process is randomized, so you will have to figure out the correct location first.

This address is individual to each process and is passed to the process once it is started.

In case you didn't know, when started in Linux, every process gets pointers to the parameters passed once it was started and pointers to a description of the environment variables it is running under passed on its stack - each of them terminated by NULL.

Additionally to these a third block of so called elf-auxiliary-vectors gets passed following the ones mentioned before. The correct location is encoded in one of these carrying the type-identifier AT_SYSINFO.

So stack layout looks like this (addresses grow downwards):

  • parameter-0
  • ...
  • parameter-m
  • NULL
  • environment-0
  • ....
  • environment-n
  • NULL
  • ...
  • auxilliary elf vector: AT_SYSINFO
  • ...
  • auxilliary elf vector: AT_NULL

Usage example

To find the correct address you will have to first skip all arguments and all environment pointers and then start scanning for AT_SYSINFO as shown in the example below:

#include <stdio.h>
#include <elf.h>

void putc_1 (char c) {
  __asm__ ("movl $0x04, %%eax\n"
           "movl $0x01, %%ebx\n"
           "movl $0x01, %%edx\n"
           "int $0x80"
           :: "c" (&c)
           : "eax", "ebx", "edx");
}

void putc_2 (char c, void *addr) {
  __asm__ ("movl $0x04, %%eax\n"
           "movl $0x01, %%ebx\n"
           "movl $0x01, %%edx\n"
           "call *%%esi"
           :: "c" (&c), "S" (addr)
           : "eax", "ebx", "edx");
}


int main (int argc, char *argv[]) {

  /* using int 0x80 */
  putc_1 ('1');


  /* rather nasty search for jump address */
  argv += argc + 1;     /* skip args */
  while (*argv != NULL) /* skip env */
    ++argv;            

  Elf32_auxv_t *aux = (Elf32_auxv_t*) ++argv; /* aux vector start */

  while (aux->a_type != AT_SYSINFO) {
    if (aux->a_type == AT_NULL)
      return 1;
    ++aux;
  }

  putc_2 ('2', (void*) aux->a_un.a_val);

  return 0;
}

As you will see by taking a look at the following snippet of /usr/include/asm/unistd_32.h on my system:

#define __NR_restart_syscall 0
#define __NR_exit            1
#define __NR_fork            2
#define __NR_read            3
#define __NR_write           4
#define __NR_open            5
#define __NR_close           6

The syscall I used is the one numbered 4 (write) as passed in the eax register. Taking filedescriptor (ebx = 1), data-pointer (ecx = &c) and size (edx = 1) as its arguments, each passed in the corresponding register.

To put a long story short

Comparing a supposedly slow running int 0x80 system call on any Intel CPU with a (hopefully) much faster implementation using the (genuinely invented by AMD) syscall instruction is comparing apples to oranges.

IMHO: Most probably the sysenter instruction instead of int 0x80 should be to the test here.