How to durably rename a file in POSIX?

Yang picture Yang · Sep 21, 2010 · Viewed 8.4k times · Source

What's the correct way to durably rename a file in a POSIX file system? Specifically wondering about fsyncs on the directories. (If this depends on the OS/FS, I'm asking about Linux and ext3/ext4).

Note: there are other questions on StackOverflow about durable renames, but AFAICT they don't address fsync-ing the directories (which is what matters to me - I'm not even modifying file data).

I currently have (in Python):

dstdirfd = open(dstdirpath, O_DIRECTORY|O_RDONLY)
rename(srcdirpath + '/' + filename, dstdirpath + '/' + filename)
fsync(dstdirfd)

Specific questions:

  • Does this also implicitly fsync the source directory? Or might I end up with the file showing up in both directories after a power cycle (meaning I'd have to check the hard link count and manually perform recovery), i.e. it's impossible to guarantee a durably atomic move operation?
  • If I fsync the source directory instead of the destination directory, will that also implicitly fsync the destination directory?
  • Are there any useful related testing/debugging/learning tools (fault injectors, introspection tools, mock filesystems, etc.)?

Thanks in advance.

Answer

Robert Siemer picture Robert Siemer · May 11, 2013

Unfortunately Dave’s answer is wrong.

Not all POSIX systems might even have a durable storage. And if they do, it is still “allowed” to be hosed after a system crash. For those systems a no-op fsync() makes sense, and such fsync() is explicitly allowed under POSIX. It is also legal for the file to be recoverable in the old directory, the new directory, both, or any other location. POSIX makes no guarantees for system crashes or file system recoveries.

The real question should be:

How to do a durable rename on systems which support that through the POSIX API?

You need to do a fsync() on both, source and destination directory, because the minimum those fsync()s are supposed to do is persist how source or destination directory should look like.

Does a fsync(destdirfd) also implicitly fsync the source directory?

  • POSIX in general: no, nothing implies that
  • ext3/4: I’m not sure if both changes to source and destination dir end up in the same transaction in the journal. If they do, they get both commited together.

Or might I end up with the file showing up in both directories after a power cycle (“crash”), i.e. it's impossible to guarantee a durably atomic move operation?

  • POSIX in general: no guarantees, but you’re supposed to fsync() both directories, which might not be atomic-durable
  • ext3/4: how much fsync() you minimally need depends on the mount options. E.g. if mounted with “dirsync” you don’t need any of those two fsync()s. At most you need both fsync()s, but I’m almost sure one is enough (atomic-durable then).

If I fsync the source directory instead of the destination directory, will that also implicitly fsync the destination directory?

  • POSIX: no
  • ext3/4: I really believe both end up in the same transaction, so it doesn’t matter which of them you fsync()
  • older kernels ext3: (if they aren’t in the same transaction) some not-so-optimal implementation did way too much syncing on fsync(), I bet it did commit every transaction which came before. And yes, a normal implementation would first link it to the destination and then remove it from the source. So the fsync(srcdirfd) would trigger the fsync() of the destination as well.
  • ext4/latest ext3: if they aren’t in the same transaction, you might be able to completely sync them independently (so do both)

Are there any useful related testing/debugging/learning tools (fault injectors, introspection tools, mock filesystems, etc.)?

For a real crash, no. By the way, a real crash goes beyond the viewpoint of the kernel. The hardware might reorder writes (and fail to write everything), corrupting the filesystem. Ext4 is better prepared against this, because it enables write barries (mount options) by default (ext3 does not) and can detect corruption with journal checksums (also a mount option).

And for learning: find out if both changes are somehow linked in the journal! :-P