views:

348

answers:

4

The situation is as follows:

A series of remote workstations collect field data and ftp the collected field data to a server through ftp. The data is sent as a CSV file which is stored in a unique directory for each workstation in the FTP server.

Each workstation sends a new update every 10 minutes, causing the previous data to be overwritten. We would like to somehow concatenate or store this data automatically. The workstation's processing is limited and cannot be extended as it's an embedded system.

One suggestion offered was to run a cronjob in the FTP server, however there is a Terms of service restriction to only allow cronjobs in 30 minute intervals as it's shared-hosting. Given the number of workstations uploading and the 10 minute interval between uploads it looks like the cronjob's 30 minute limit between calls might be a problem.

Is there any other approach that might be suggested? The available server-side scripting languages are perl, php and python.

Upgrading to a dedicated server might be necessary, but I'd still like to get input on how to solve this problem in the most elegant manner.

+4  A: 

You might consider a persistent daemon that keeps polling the target directories:

grab_lockfile() or exit();
while (1) {
    if (new_files()) {
        process_new_files();
    }
    sleep(60);
}

Then your cron job can just try to start the daemon every 30 minutes. If the daemon can't grab the lockfile, it just dies, so there's no worry about multiple daemons running.

Another approach to consider would be to submit the files via HTTP POST and then process them via a CGI. This way, you guarantee that they've been dealt with properly at the time of submission.

bmdhacks
+4  A: 

Most modern Linux's will support inotify to let your process know when the contents of a diretory has changed, so you don't even need to poll.

Edit: With regard to the comment below from Mark Baker :

"Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files."

That will happen with the inotify watch you set on the directory level - the way to make sure you then don't pick up the partial file is to set a further inotify watch on the new file and look for the IN_CLOSE event so that you know the file has been written to completely.

Once your process has seen this, you can delete the inotify watch on this new file, and process it at your leisure.

GodEater
+1  A: 

The 30 minute limitation is pretty silly really. Starting processes in linux is not an expensive operation, so if all you're doing is checking for new files there's no good reason not to do it more often than that. We have cron jobs that run every minute and they don't have any noticeable effect on performance. However, I realise it's not your rule and if you're going to stick with that hosting provider you don't have a choice.

You'll need a long running daemon of some kind. The easy way is to just poll regularly, and probably that's what I'd do. Inotify, so you get notified as soon as a file is created, is a better option.

You can use inotify from perl with Linux::Inotify, or from python with pyinotify.

Be careful though, as you'll be notified as soon as the file is created, not when it's closed. So you'll need some way to make sure you don't pick up partial files.

With polling it's less likely you'll see partial files, but it will happen eventually and will be a nasty hard-to-reproduce bug when it does happen, so better to deal with the problem now.

Mark Baker
+1  A: 

If you're looking to stay with your existing FTP server setup then I'd advise using something like inotify or daemonized process to watch the upload directories. If you're OK with moving to a different FTP server, you might take a look at pyftpdlib which is a Python FTP server lib.

I've been a part of the dev team for pyftpdlib a while and one of more common requests was for a way to "process" files once they've finished uploading. Because of that we created an on_file_received() callback method that's triggered on completion of an upload (See issue #79 on our issue tracker for details).

If you're comfortable in Python then it might work out well for you to run pyftpdlib as your FTP server and run your processing code from the callback method. Note that pyftpdlib is asynchronous and not multi-threaded, so your callback method can't be blocking. If you need to run long-running tasks I would recommend a separate Python process or thread be used for the actual processing work.

Jay