How is __thread
in gcc implemented? Is it simply a wrapper over pthread_getspecific
and pthread_setspecific
?
With my program that uses the posix API for TLS, I'm kind of disappointed now seeing that 30% of my program runtime is spent on pthread_getspecific
. I called it on the entry of each function call that needs the resource. The compiler doesn't seem to optimize out pthread_getspecific
after inlining optimization. So after the functions are inlined the code is basically searching for the correct TLS pointer again and again to get the same pointer returned.
Will __thread
help me in this situation? I know that there is thread_local
in C11, but the gcc I have doesn't support it yet. (But now I see that my gcc does support _Thread_local
just not the macro.)
I know I can simply test it and see. But I have to go somewhere else now, and I'd like to know better on a feature before I attempt a quite big rewrite.
Recent GCC, e.g. GCC 5 do support C11 and its thread_local
(if compiling with e.g. gcc -std=c11
). As FUZxxl commented, you could use (instead of C11 thread_local
) the __thread
qualifier supported by older GCC versions. Read about Thread Local Storage.
pthread_getspecific
is indeed quite slow (it is in the POSIX library, so is not provided by GCC but e.g. by GNU glibc or musl-libc) since it involves a function call. Using thread_local
variables will very probably be faster.
Look into the source code of MUSL's thread/pthread_getspecific.c
file
for an example of implementation. Read this answer to a related question.
And _thread
& thread_local
are (often) not magically translated to calls to pthread_getspecific
. They usually involve some specific address mode and/or register (details are implementation specific, related to the ABI; on Linux, I guess that since x86-64 has more registers & address modes, its implementation of TLS is faster than on i386), with help from the compiler, the linker and the runtime system. It could happen on the contrary that some implementations of pthread_getspecific
are using some internal thread_local
variables (in your implementation of POSIX threads).
As an example, compiling the following code
#include <pthread.h>
const extern pthread_key_t key;
__thread int data;
int
get_data (void) {
return data;
}
int
get_by_key (void) {
return *(int*) (pthread_getspecific (key));
}
using GCC 5.2 (on Debian/Sid) with gcc -m32 -S -O2 -fverbose-asm
gives the following code for get_data
using TLS:
.type get_data, @function
get_data:
.LFB3:
.cfi_startproc
movl %gs:data@ntpoff, %eax # data,
ret
.cfi_endproc
and the following code of get_by_key
with an explicit call to pthread_getspecific
:
get_by_key:
.LFB4:
.cfi_startproc
subl $24, %esp #,
.cfi_def_cfa_offset 28
pushl key # key
.cfi_def_cfa_offset 32
call pthread_getspecific #
movl (%eax), %eax # MEM[(int *)_4], MEM[(int *)_4]
addl $28, %esp #,
.cfi_def_cfa_offset 4
ret
.cfi_endproc
Hence using TLS with __thread
(or thread_local
in C11) should probably be faster than using pthread_getspecific
(avoiding the overhead of a call).
Notice that thread_local
is a convenience macro defined in <threads.h>
(a C11 standard header).