views:

44

answers:

3

I have an app that I'm writing that takes files in a specific directory that have been uploaded via SFTP and moves them to S3.

I have a problem where my cron job starts uploading a file when it's not completely uploaded. I have thought of every way to try and wait until the file is complete, but I have no way of knowing (that I know of).

I'm hoping that the collective genius of SO would be able to shed some light on this!

A: 

Is there any way you can add a step after the SFTP transfer? The idea is to SFTP the files to a temporary directory, then once that's done have the same client execute (via SSH) a script to mv the files over to the directory the cron job is looking at. mv is atomic on many local Unix filesystems, so the cron job will only either see the old file or the new one.

Of course, if you can execute a script after the SFTP transfer you can just have the script do the transfer to S3, without the cron job ;)

ZoogieZork
+2  A: 

There are a number of ways to handle this:

  1. Change the upload process to upload the data file itself (e.g., data.txt) followed by a sentinel file (e.g., data.txt.sentinel). Then wait for the sentinel before processing the data file and deleting them both. Data files older then N days with no corresponding sentinel - just delete them. This is only good if you can change the uploader.

  2. If you can evaluate the content of the file to check completeness, this is another way. For example, if you're only uploading HTML files, you could check that it ends with </html>. Not always possible unless you can control what's being uploaded.

  3. The not-been-modified-for-a-while method. Basically, if the file hasn't been modified for N minutes, you can assume the upload has been finished. This may still result in the processing of incomplete files where the transfer has failed partway through.

All these methods have their advantages and drawbacks and you will have to decide which is the best for you. We try to opt for number 1 where we can influence the uploading side.

And remember that N is configurable in the above scenarios. You need to balance the possibility that a too-small N will result in you processing an incomplete file in option 3 but too large a value of N will delay the processing of said file.

paxdiablo
The problem is that the files are all random files (random file types, that is), and their is no specific uploader. I like option 3, so I may try that.
WedTM
A: 

We are using pure-ftpd for a very similar process. Rather then having a cron job do the uploads, we use the upload script option of pure-ftp, which triggers a script every time an upload is complete. You might consider using a similar mechanism if it is available with your ftp server.

Paul McMahon
We no longer support FTP as it is not as secure as SFTP, and our content requires security. Good choice for non-secure uploads, though!
WedTM