ansaurus

Question

Answer 1

+2 A:

You're probably limited by the speed of your disk. Python already does buffering so reading it line by line is efficient.

Douglas Leeder 2010-03-05 16:51:28

Answer 2

+4 A:

Your proposed for line in open('/data/workfile', 'r'): will use a generator, so the entire file will not be read into memory. I'd go with that until it actually turns out to be too slow.

Will McCutchen 2010-03-05 16:52:04

I agree. The proposal would probably hurt performance by (1) spamming the processor data cache, and (2) thrashing the Python global interpreter lock.

Daniel Newby 2010-03-06 00:42:00

Answer 3

A:

The data structure you want to use is a Queue (it has the proper blocking mechanisms by example for concurrent writes), that is available in Multiprocessing module.

If you have no dependency between the processing of your data, you could MAP the line by line generator to a pool of processes with the functions in that module to multicore-enable the whole thing in a few lines.

See Also mapReduce approaches (it might be a little overkill however)

makapuf 2010-03-05 17:34:01

ansaurus

tags:

views:

answers:

Python: better file I/0 using os.fork ?

related questions