Cost of context switch between threads of same process, on Linux

R.. GitHub STOP HELPING ICE picture R.. GitHub STOP HELPING ICE · May 11, 2011 · Viewed 7.3k times · Source

Is there any good empirical data on the cost of context switching between threads of the same process on Linux (x86 and x86_64, mainly, are of interest)? I'm talking about the number of cycles or nanoseconds between the last instruction one thread executes in userspace before getting put to sleep voluntarily or involuntarily, and the first instruction a different thread of the same process executes after waking up on the same cpu/core.

I wrote a quick test program that constantly performs rdtsc in 2 threads assigned to the same cpu/core, stores the result in a volatile variable, and compares to its sister thread's corresponding volatile variable. The first time it detects a change in the sister thread's value, it prints the difference, then goes back to looping. I'm getting minimum/median counts of about 8900/9600 cycles this way on an Atom D510 cpu. Does this procedure seem reasonable, and do the numbers seem believable?

My goal is to estimate whether, on modern systems, thread-per-connection server model could be competitive with or even outperform select-type multiplexing. This seems plausible in theory, as the transition from performing IO on fd X to fd Y involves merely going to sleep in one thread and waking up in another, rather than multiple syscalls, but it's dependent on the overhead of context switching.

Answer

caf picture caf · May 11, 2011

(Disclaimer: This isn't a direct answer to the question, it's just some suggestions that I hope will be helpful).

Firstly, the numbers you're getting certainly sound like they're within the ballpark. Note, however, that the interrupt / trap latency can vary a lot among different CPU models implementing the same ISA. It's also a different story if your threads have used floating point or vector operations, because if they haven't the kernel avoids saving/restoring the floating point or vector unit state.

You should be able to get more accurate numbers by using the kernel tracing infrastructure - perf sched in particular is designed to measure and analyse scheduler latency.

If your goal is to model thread-per-connection servers, then you probably shouldn't be measuring involuntary context switch latency - usually in such a server, the majority of context switches will be voluntary, as a thread blocks in read() waiting for more data from the network. Therefore, a better testbed might involve measuring the latency from one thread blocking in a read() to another being woken up from the same.

Note that in a well-written multiplexing server under heavy load, the transition from fd X to fd Y will often involve the same single system call (as the server iterates over a list of active file descriptors returned from a single epoll()). One thread also ought to have less cache footprint than multiple threads, simply through having only one stack. I suspect the only way to settle the matter (for some definition of "settle") might be to have a benchmark shootout...