views:

359

answers:

5

I'm writing an application that monitors a directory for new input files by polling the directory every few seconds. New files may often be several megabytes, and so take some time to fully arrive in the input directory (eg: on copy from a remote share).

Is there a simple way to detect whether a file is currently in the process of being copied? Ideally any method would be platform and filesystem agnostic, but failing that specific strategies might be required for different platforms.

I've already considered taking two directory listings separaetd by a few seconds and comparing file sizes, but this introduces a time/reliability trade-off that my superiors aren't happy with unless there is no alternative.

For background, the application is being written as a set of Matlab M-files, so no JRE/CLR tricks I'm afraid...


Edit: files are arriving in the input directly by straight move/copy operation, either from a network drive or from another location on a local filesystem. This copy operation will probably be initiated by a human user rather than another application.

As a result, it's pretty difficult to place any responsibility on the file provider to add control files or use an intermediate staging area...


Conclusion: it seems like there's no easy way to do this, so I've settled for a belt-and-braces approach - a file is ready for processing if:

  • its size doesn't change in a certain period of time, and
  • it's possible to open the file in read-only mode (some copying processes place a lock on the file).

Thanks to everyone for their responses!

+6  A: 

The safest method is to have the application(s) that put files in the directory first put them in a different, temporary directory, and then move them to the real one (which should be an atomic operation even when using FTP or file shares). You could also use naming conventions to achieve the same result within one directory.

Edit: It really depends on the filesystem, on whether its copy functionality even has the concept of a "completed file". I don't know the SMB protocol well, but if it has that concept, you could write an app that exposes an SMB interface (or patch Samba) and an API to get notified for completed file copies. Probably a lot of work though.

Michael Borgwardt
+1  A: 

One simple possibility would be to poll at a fairly large interval (2 to 5 minutes) and only acknowledge the new file the second time you see it.

I don't know of a way in any OS to determine whether a file is still being copied, other than maybe checking if the file is locked.

Bork Blatt
+1  A: 

How are the files getting there? Can you set an attribute on them as they are written and then change the attribute when write is complete? This would need to be done by the thing doing the writing ... which sounds like it isn't an option.

Otherwise, caching the listing and treating a file as new if it has the same file size for two consecutive listings is the best way I can think of.

Alternatively, you could use the modified time on the file - the file has to be new and have a modified time that is at least x in the past. But I think this will be about equivalent to caching the listing.

It you are polling the folder every few seconds, its not much of a time penalty is it? And its platform agnostic.

Also, linux only: http://www.linux.com/feature/144666

Like cron but for files. Not sure how it deals with your specific problem - but may be of use?

benlumley
+3  A: 

This is a middleware problem as old as the hills, and the short answer is: no.

The two 'solutions' put the onus on the file-uploader: (1) upload the file in a staging directory and then move it into the destination directory (2) upload the file, and then create/upload a 'ready' file that indicates the state of the content file.

The 1st one is the better, but both are inelegant. The truth is that better communication media exist than the filesystem. Consider using some IPC that involves only a push or a pull (and not both, as does the filesystem) such as an HTTP POST, a JMS or MSMQ queue, etc. Furthermore, this can also be synchronous, allowing the process receiving the file to acknowledge the content, even check it for worthiness, and hand the client a receipt - this is the righteous road to non-repudiation. Follow this, and you will never suffer arguments over whether a file was or was not delivered to your server for processing.

M.

Martin Cowie
Nearly a year after my initial post, I am increasingly close to penning a paper titled "FTP considered harmful". I see the "communication by big files over FTP" anti-pattern in so many so called 'Enterprise' shops it positively alarms me. I am certain it harks back to the era when communication between applications was effected by carrying a tape between machines - and that no-one has yet challenged the cries of "We've always done it this way".
Martin Cowie
Martin Cowie
A: 

What is your OS. In unix you can use the "lsof" utility to determine if a user has the file open for write. Apparently somewhere in the MS Windows Process Explorer there is the same functionality.

Alternativly you could just try an exclusive open on the file and bail out of this fails. But this can be a little unreliable and its easy to tread on your own toes.

James Anderson