How do system calls work?

xyz picture xyz · Jun 5, 2011 · Viewed 19k times · Source

I understand that a user can own a process and each process has an address space (which contains valid memory locations, this process can reference). I know that a process can call a system call and pass parameters to it, just like any other library function. This seems to suggest that all system calls are in a process address space by sharing memory, etc. But perhaps, this is only an illusion created by the fact that in high level programming language, system calls look like any other function, when a process calls it.

But, now let me take a step deeper and analyze more closely on what happens under the hood. How does compiler compile a system call? It perhaps pushes the system call name and parameters supplied by the process in a stack and then put the assembly instruction say "TRAP" or something -- basically the assembly instruction to call a software interrupt.

This TRAP assembly instruction is executed by hardware by first toggling the mode bit from user to kernel and then setting the code pointer to say beginning of interrupt service routines. From this point on, the ISR executes in kernel mode, which picks up the parameters from the stack (this is possible, because kernel has access to any memory location, even the ones owned by user processes) and executes the system call and in the end relinquishes the CPU, which again toggles the mode bit and the user process starts from where it left off.

Is my understanding correct?

Attached is rough diagram of my understanding: enter image description here

Answer

sarnold picture sarnold · Jun 5, 2011

Your understanding is pretty close; the trick is that most compilers will never write system calls, because the functions that programs call (e.g. getpid(2), chdir(2), etc.) are actually provided by the standard C library. The standard C library contains the code for the system call, whether it is called via INT 0x80 or SYSENTER. It'd be a strange program that makes system calls without a library doing the work. (Even though perl provides a syscall() function that can directly make system calls! Crazy, right?)

Next, the memory. The operating system kernel sometimes has easy address-space access to the user process memory. Of course, protection modes are different, and user-supplied data must be copied into the kernel's protected address space to prevent modification of user-supplied data while the system call is in flight:

static int do_getname(const char __user *filename, char *page)
{
    int retval;
    unsigned long len = PATH_MAX;

    if (!segment_eq(get_fs(), KERNEL_DS)) {
        if ((unsigned long) filename >= TASK_SIZE)
            return -EFAULT;
        if (TASK_SIZE - (unsigned long) filename < PATH_MAX)
            len = TASK_SIZE - (unsigned long) filename;
    }

    retval = strncpy_from_user(page, filename, len);
    if (retval > 0) {
        if (retval < len)
            return 0;
        return -ENAMETOOLONG;
    } else if (!retval)
        retval = -ENOENT;
    return retval;
}

This, while it isn't a system call itself, is a helper function called by system call functions that copies filenames into the kernel's address space. It checks to make sure that the entire filename resides within the user's data range, calls a function that copies the string in from user space, and performs some sanity checks before the returning.

get_fs() and similar functions are remnants from Linux's x86-roots. The functions have working implementations for all architectures, but the names remain archaic.

All the extra work with segments is because the kernel and userspace might share some portion of the available address space. On a 32-bit platform (where the numbers are easy to comprehend), the kernel will typically have one gigabyte of virtual address space, and user processes will typically have three gigabytes of virtual address space.

When a process calls into the kernel, the kernel will 'fix up' the page table permissions to allow it access to the whole range, and gets the benefit of pre-filled TLB entries for user-provided memory. Great success. But when the kernel must context switch back to userspace, it has to flush the TLB to remove the cached privileges on kernel address space pages.

But the trick is, one gigabyte of virtual address space is not sufficient for all kernel data structures on huge machines. Maintaining the metadata of cached filesystems and block device drivers, networking stacks, and the memory mappings for all the processes on the system, can take a huge amount of data.

So different 'splits' are available: two gigs for user, two gigs for kernel, one gig for user, three gigs for kernel, etc. As the space for the kernel goes up, the space for user processes goes down. So there is a 4:4 memory split that gives four gigabytes to the user process, four gigabytes to the kernel, and the kernel must fiddle with segment descriptors to be able to access user memory. The TLB is flushed entering and exiting system calls, which is a pretty significant speed penalty. But it lets the kernel maintain significantly larger data structures.

The much larger page tables and address ranges of 64 bit platforms probably makes all the preceding look quaint. I sure hope so, anyway.