Rsync checksum only for same size files

Sylvain picture Sylvain · Jun 25, 2013 · Viewed 12.8k times · Source

There's a bunch of threads regarding rsync checksum, but none seems addressing this need, which would be the most effective and fastest way to sync, at least in my case:

  • same time and same size ► skip file (no transfer, no checksum)
  • different sizes ► transfer file (no checksum)
  • different times and same size ► perform checksum ► transfer only if checksums differ

I noticed that the option --checksum can really take a long time to mirror a folder, if there are a lot of files. Using this option alone will run a checksum on every single file, which is very safe but very slow. Besides, it will induce read access overhead to compute the checksum.
The option --ignore-times is not what I want, if time and size both match, the chance that the files are different is insignificant, I'm willing to take the risk not to transfer.
The option --size-only is incomplete, as there is a good chance that files having same size but different times may actually be different files (eg. changing a char in another may not affect the size, just the time of edition).

Is there a way to perform the mirroring as per the combination above, with rsync (did I miss something in the manpages) or with any other Linux tools?
Thanks.

Answer

MRV picture MRV · May 18, 2014

When determining whether to transfer files (or with --dry-run, whether to list files), rsync will always transfer files that differ in filesize. However, when files are the same size, rsync has several options:

  • with --size-only: never transfer files
  • with --ignore-times: always transfer files
  • default: if timestamps differ, transfer files
  • with --checksum: calculate checksums and transfer files if they differ

The behavior that you want would be a combination of the last two: "if timestamps differ, calculate checksums and transfer files if the checksums differ as well". This is not currently an option in rsync.

Unfortunately, looking at the rsync source-code, it appears it would be non-trivial to add this functionality. Currently, if checksums are used, the remote rsync gathers size, timestamp and checkstum information and sends them all together. The desired behavior would require that the remote rsync first sends over the size and timestamp, and when the local rsync determines that a checksum is needed, returns to the file to get the checksum. But the whole "remote rsync returns to the file" aspect is not present in the current code, and would first need to be written.

When you run an actual transfer, the second step can effectively be done during the transfer-process: transfer of files that do not differ is very efficient. So then the default behaviour of rsync would suffice. When using --dry-run the best approach would probably be to run rsync with default behaviour first, gather the --dry-run output, and then run rsync again, with --checksum, on the files found in the first run.