views:

53

answers:

4

If I have a multiprocess system that needs to process a bunch of directories, 1 directory per process, how likely is it that two processes will happen to grab the same directory?

Say I have dir/1 all the way to dir/99. I figure that if I touch a .claimed file in the dir that the process is working on, there won't be conflicts. Are there problems with my approach?


There's a bit more complexity. It's not only multi-process, but it's distributed across several computers.

A: 

If you are worried about collisions, then I would have a master process that delegates the directories out to the processes. Another option that I've used before is to list all of your directories in a database table. Then you can use the database's built-in concurrency features to pull out records and mark them as locked.

David
If the chance of collisions are very minimal, then I wouldn't care.
Blaine LaFreniere
are you sure you wouldn't care? The chances of a collision are dependent on what you are doing and how since the problem with your described implementation is that it has a race condition in it the chances of a problem have to do with how long directories take to process, how many there are, and the variability in processing time
Spudd86
A: 

I do not know how your application works, but if your application is processing the folders recursively given a root folder, it is very likely you will double process your files.

Here are some options

Option 1

if you have full control of the application, you can modify your application to read in a list of folders (from a configuration file).

myprogram.exe file1.config

myprogram.exe file2.config

where file1.config contains the names of directories 1-50 and file2.config contains the names of directories 51-100

Option 2

use the for loop in your o/s to specify explicitly which folders your program should process. (Note: I have specify a DOS command syntax. Pleae modify yours according to your O/S).

for %f in (dir1, dir2, dir3, dir4) do start myprogram.exe %f
for %f in (dir11, dir12, dir13, dir14) do start myprogram.exe %f
Syd
This would process each directory sequentially, by invoking the program multiple times; it won't process the directories concurrently via multiple simultaneous processes, which is what the OP is asking about.
Stephen P
sorry, let me rephrase. you invoke the commands simultaneously
Syd
A: 

If the number of worker threads and the number of directories is known, you can divide the range of directories between the processes and thus avoid collisions.

So e.g. process 1 knows to take care of dir/1 to dir/10.

dantje
+1  A: 

I recall something about directory creation being atomic, but not file creation, so your .claimed ought to be a directory - however I don't recall what OS that applied to.

I'd take a different approach: list all the directories you want to process, writing the output to a pipe, which acts as a work queue that each process will read from. IIRC system pipe semantics (named or anonymous) are that reading from a pipe is an atomic operation: two processes will not be able to read the same data.

A master process could write the list to a pipe and spawn the worker processes, or the worker processes could just block trying to read until you manually write the list to the pipe.

Stephen P
One pipe per worker would be a better idea since it'd be hard to make sure that you get the directory names in whole chuncks
Spudd86
awesome, and this page appears to confirm that making a directory is atomic: http://rcrowley.org/2010/01/06/things-unix-can-do-atomically.html
Blaine LaFreniere
@Blaine - thanks for the link, that's a good page.
Stephen P