I am reading the linux kernel implementation of doubled linked list. I do not understand the use of the macro WRITE_ONCE(x, val)
. It is defined as follow in compiler.h:
#define WRITE_ONCE(x, val) x=(val)
It is used seven times in the file, such as
static inline void __list_add(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
next->prev = new;
new->next = next;
new->prev = prev;
WRITE_ONCE(prev->next, new);
}
I have read that it is used for avoiding race conditions.
I have two questions:
1/ I thought macro was replace by code at compile time. So how this code differs to the following one ? How this macro can avoid race conditions ?
static inline void __list_add(struct list_head *new,
struct list_head *prev,
struct list_head *next)
{
next->prev = new;
new->next = next;
new->prev = prev;
prev->next = new;
}
2/ How to know when we should use it ? For instance, it is used for __lst_add()
but not for __lst_splice()
:
static inline void __list_splice(const struct list_head *list,
struct list_head *prev,
struct list_head *next)
{
struct list_head *first = list->next;
struct list_head *last = list->prev;
first->prev = prev;
prev->next = first;
last->next = next;
next->prev = last;
}
edit:
Here is a commit message concerning this file and WRITE_ONCE
, but it does not help me to understand anything...
list: Use WRITE_ONCE() when initializing list_head structures
Code that does lockless emptiness testing of non-RCU lists is relying on INIT_LIST_HEAD() to write the list head's ->next pointer atomically, particularly when INIT_LIST_HEAD() is invoked from list_del_init(). This commit therefore adds WRITE_ONCE() to this function's pointer stores that could affect the head's ->next pointer.
The first definition you refer to is part of the kernel lock validator, aka "lockdep". WRITE_ONCE
(and others) don't need special treatment, but the reason why is the subject of another question.
The relevant definition would be here, and a very terse comment states their purpose to be:
Prevent the compiler from merging or refetching reads or writes.
...
Ensuring that the compiler does not fold, spindle, or otherwise mutilate accesses that either do not require ordering or that interact with an explicit memory barrier or atomic instruction that provides the required ordering.
But what do those words mean?
The problem is actually plural:
Read/write "tearing" : replacing a single memory access with many smaller ones. GCC may (and does!) in certain situations replace something like p = 0x01020304;
with two 16-bit store-immediate instructions -instead of presumably placing the constant in a register and then a memory access, and so forth. WRITE_ONCE
would allow us to say to GCC, "don't do that", like so: WRITE_ONCE(p, 0x01020304);
C compilers have stopped guaranteeing that a word access is atomic. Any program which is non-race-free can be miscompiled with spectacular results. Not only that, but a compiler may decide to not keep certain values in registers inside a loop, leading to multiple references that can mess up code like this:
for(;;) { owner = lock->owner; if (owner && !mutex_spin_on_owner(lock, owner)) break; /* ... */ }
We begin by noting that the Linux kernel demands to be built with GCC. Thus, there's only one compiler we need to take care of with the solution, and we can use its documentation as the only guide.
For a generic solution, we need to handle memory accesses of all sizes. We have all the various types of specific widths, and everything else. We also note that we don't need to specifically tag memory accesses which are already in critical sections (why not?).
For sizes of 1, 2, 4, and 8 bytes, there are appropriate types, and volatile
specifically disallows GCC from applying the optimisation we referred to in (1), as well as taking care of other cases (last bullet point under "COMPILER BARRIERS"). It also disallows GCC of miscompiling the loop in (2), because it would move the volatile
access across a sequence point, and that's disallowed by the C standard. Linux uses what we call a "volatile access" (see below) instead of tagging an object as volatile. We could solve our problem by marking the specific object as volatile
, but this is (almost?) never a good choice. There are many reasons it could be harmful.
This is how a volatile (write) access is implemented in the kernel for an 8-bit wide type:
*(volatile __u8_alias_t *) p = *(__u8_alias_t *) res;
Suppose we didn't know exactly what volatile
does - and finding out isn't easy! (check out #5) - another way to accomplish this would be to place memory barriers: This is exactly what Linux does in case the size is anything other than 1,2,4, or 8, resorting to memcpy
and placing memory barriers before and after the call. Memory barriers easily solve problem (2) as well, but incur large performance penalties.
I hope I've covered an overview without delving into interpretations of the C standard, but if you'd like I could take the time to do that.