keep rsync from removing unfinished source files

aaronsw picture aaronsw · Sep 7, 2008 · Viewed 14.2k times · Source

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:

$ rsync --remove-source-files speed:/var/crawldir .

but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

Answer

Jason Cohen picture Jason Cohen · Sep 7, 2008

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.

If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.

The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.

How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.