I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?
It seems to me the problem is transferring a file before it's complete, not that you're deleting it.
If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.
The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.
How about this: Mount mass
as a remote file system (NFS would work) in speed
. Then just web-crawl the files directly.