views:

567

answers:

4

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:

$ rsync --remove-source-files speed:/var/crawldir .

but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

+2  A: 

How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.

Paul Tomblin
+2  A: 

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.

If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.

The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.

How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.

Jason Cohen
A: 

Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.

Cd-MaN
A: 

If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

pjz