ansaurus

Question

How to handle new files to process in cron job

Answer 1

A:

I see several issues.

If you have overlapping cron jobs you need to have a locking mechanism to control access. Only allow one process at a time to eliminate the overlap problem. You might setup a shell script to do that. Create a 'lock' by making a directory (mkdir is atomic), process the data, then delete the lock directory. If the shell script finds the directory already exists when it tries to make it then you know another copy is already running and it can just exit.

If you can't change the cron table(s) then just rename the executable and name your shell script the same as the old executable.

Hashes are not guaranteed to be unique identifiers for files, it's likely they are, but it's not absolutely guaranteed.

Jay 2010-01-07 19:11:53

overlapping cron jobs is one of my problems but not my only problem, I am actually using http://unixwiz.net/tools/lockrun.html as a solution for that , but the job takes a lot longer to run than needed due to the unnecessary processing of already used files. If I have to deal with this so be it, but I was trying to find a way to better do this with a hash or something else.

salparadise 2010-01-07 19:17:16

Answer 2

+2 A:

I don't know enough about what is in these files, so this may not work for you, but if you have only one intended consumer, I would recommend using directories and moving the files to reflect their state. Specifically, you could have a dir structure like

/waiting
/progress
/done

and use the relative atomicity of mv to change the "state" of each file. (Whether mv is truly atomic depends on your filesystem, I believe.)

When your processing task wants to work on a file, it moves it from waiting to progress (and makes sure that the move succeeded). That way, no other task can pick it up, since it's no longer waiting. When the file is complete, it gets moved from progress to done, where a cleanup task might delete or archive old files that are no longer needed.

MikeSep 2010-01-07 19:25:55

Sounds promising. I will do some testing with this concept. I will update with what I come up with. Thanks.

salparadise 2010-01-07 20:26:51

Answer 3

A:

Why not just move a processed file to a different directory?

You mentioned overlapping cron jobs. Does this mean one conversion process can start before the previous one finished? That means you would perform the move at the beginning of the conversion. If you are worries about an interrupted conversion, use an intermediate directory, and move to a final directory after completion.

gary 2010-01-07 19:33:08

OK - we were all responding at the same time...

gary 2010-01-07 19:38:09

Answer 4

A:

If I'm reading the code correctly, you're updating the database (by which I mean the log of files processed) at the very end. So when you have a huge file that's being processed and not yet complete, another cron job will 'legally' start working on it. - both completing succesfully resulting in two entries in the database.

I suggest you move up the logging-to-database, which would act as a lock for subsequent cronjobs and having a 'success' or 'completed' at the very end. The latter part is important as something that's shown as processing but doesnt have a completed state (coupled with the notion of time) can be programtically concluded as an error. (That is to say, a cronjob tried processing it but never completed it and the log show processing for 1 week!)

To summarize

Move up the log-to-database so that it would act as a lock
Add a 'success' or 'completed' state which would give the notion of errored state

PS: Dont take it in the wrong way, but the code is a little hard to understand. I am not sure whether I do at all.

jeffjose 2010-01-07 20:34:06

at least is just a "little hard to understand", you would have cried looking at my perl scripts.

salparadise 2010-01-07 23:43:23

Answer 5

+2 A:

A good way to handle/process files that are created at random times is to use incron rather than cron. (Note: since incron uses the Linux kernel's inotify syscalls, this solution only works with Linux.)

Whereas cron runs a job based on dates and times, incron runs a job based on changes in a monitored directory. For example, you can configure incron to run a job every time a new file is created or modified.

On Ubuntu, the package is called incron. I'm not sure about RedHat, but I believe this is the right package: http://rpmfind.net//linux/RPM/dag/redhat/el5/i386/incron-0.5.9-1.el5.rf.i386.html.

Once you install the incron package, read

man 5 incrontab

for information on how to setup the incrontab config file. Your incron_config file might look something like this:

/var/ss01/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss02/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss03/ IN_CLOSE_WRITE /path/to/processing/script.py $#
/var/ss04/ IN_CLOSE_WRITE /path/to/processing/script.py $#

Then to register this config with the incrond daemon, you'd run

incrontab /path/to/incron_config

That's all there is to it. Now whenever a file is created in /var/ss01, /var/ss02, /var/ss03 or /var/ss04, the command

/path/to/processing/script.py $#

is run, with $# replaced by the name of the newly created file.

This will obviate the need to store/compare hashes, and files will only get processed once -- immediately after they are created.

Just make sure your processing script does not write into the top level of the monitored directories. If it does, then incrond will notice the new file created, and launch script.py again, sending you into an infinite loop.

incrond monitors individual directories, and does not recursively monitor subdirectories. So you could direct tshark to write to /var/ss01/tobeprocessed, use incron to monitor /var/ss01/tobeprocessed, and have your script.py write to /var/ss01, for example.

PS. There is also a python interface to inotify, called pyinotify. Unlike incron, pyinotify can recursively monitor subdirectories. However, in your case, I don't think the recursive monitoring feature is useful or necessary.

unutbu 2010-01-07 21:13:17

actually this sounds like the most promising method.So if tshark created file "file1" and it takes an hour to finish creating piping info to it as is running, incron is going to detect that is a modified file and it needs to process it?Thanks.

salparadise 2010-01-07 22:57:58

Right, that is how incron should work. By the way, you may have to add your username to /etc/incron.allow before running `incrontab`...

unutbu 2010-01-07 23:54:08

Sorry, I think I made a mistake in my original post. I edited my post to use IN_CLOSE_WRITE instead of IN_CREATE, since it is the IN_CLOSE_WRITE event is the one that happens after the file is closed.

unutbu 2010-01-08 00:28:43

cool. This looks great, I will test and update.

salparadise 2010-01-08 01:03:48

so before I get started on this, do you know if it supports NFS? my google-fu is failing.

salparadise 2010-01-09 01:10:21

I don't have first-hand experience, but according to http://beagle-project.org/FAQ, inotify (and presumably therefore incron) will work only if the changes to the NFS directory are made locally (on the linux machine). Remote changes will not trigger an inotify event.

unutbu 2010-01-09 02:26:00

ansaurus

tags:

views:

answers:

How to handle new files to process in cron job

related questions