views:

80

answers:

2

I've got multiple python processes (typically 1 per core) transforming large volumes of data that they are each reading from dedicated sources, and writing to a single output file that each opened in append mode.

Is this a safe way for these programs to work?

Because of the tight performance requirements and large data volumes I don't think that I can have each process repeatedly open & close the file. Another option is to have each write to a dedicated output file and a single process concatenate them together once they're all done. But I'd prefer to avoid that.

Thanks in advance for any & all answers and suggestions.

+3  A: 

Have you considered using the multiprocessing module to coordinate between the running programs in a thread-like manner? See in particular the queue interface; you can place each completed work item on a queue when completed, and have a single process reading off the queue and writing to your output file.

Alternately, you can have each subprocess maintain a separate pipe to a parent process which does a select() call from all of them, and copies data to the output file when appropriate. Of course, this can be done "by hand" (without the multiprocessing module) as well as with it.

Alternately, if the reason you're avoiding threads is to avoid the global interpreter lock, you might consider a non-CPython implementation (such as Jython or IronPython).

Charles Duffy
Assuming 8 processes, each writing 250,000 100-byte records to a single file and ordering does not matter - which is the simplest way of using a using a named pipe?
KenFar
@KenFar - the processes each have a named pipe to a single process which does nothing but read from those pipes (a record at a time, using the select call to find out which pipes are immediately readable) and write to the single output file. The _simplest_ way of using pipes is to use the multiprocessing module and let it set the pipes up for you automatically. (Using _named_ pipes is probably unnecessary complexity; if all these processes are started by a single parent, there's no point to it).
Charles Duffy
+3  A: 

Your procedure is "safe" in that no crashes will result, but data coming (with very unlucky timing) from different processes could get mixed up -- e.g., process 1 is appending a long string of as, process 2 a long string of b, you could end up in the file with lots of as then the bs then more as (or other combinations / mixings).

Problem is, .write is not guaranteed to be atomic for sufficiently long string arguments. If you have a tight boundary on the arguments, less than your fs/os's blocksize, you might be lucky. Otherwise, try using the logging module, which does take more precautions (but perhaps those precautions might slow you down... you'll need to benchmark) exactly because it targets "log files" that are often being appended to by multiple programs.

Alex Martelli