views:

92

answers:

3

My problem is quite simple: I have a 400MB file filled with 10,000,000 lines of data. I need to iterate over each line, do something, and remove the line from memory to avoid filling-up too much RAM.

Since my machine has several processor, my initial idea to optimize this process was to create two different processes. One would read the file several lines at a time and gradually fill a list (one element of the list being one line in the file). The other would have access to this same list and would pop() elements out of it and process them. This would effectively create a list that would grow from one side and shrink from the other.

In other words, this mechanism should implement a buffer that would constantly be populated with lines for the second process to crunch. But maybe this is no faster than using:

for line in open('/data/workfile', 'r'):
+2  A: 

You're probably limited by the speed of your disk. Python already does buffering so reading it line by line is efficient.

Douglas Leeder
+4  A: 

Your proposed for line in open('/data/workfile', 'r'): will use a generator, so the entire file will not be read into memory. I'd go with that until it actually turns out to be too slow.

Will McCutchen
I agree. The proposal would probably hurt performance by (1) spamming the processor data cache, and (2) thrashing the Python global interpreter lock.
Daniel Newby
A: 

The data structure you want to use is a Queue (it has the proper blocking mechanisms by example for concurrent writes), that is available in Multiprocessing module.

If you have no dependency between the processing of your data, you could MAP the line by line generator to a pool of processes with the functions in that module to multicore-enable the whole thing in a few lines.

See Also mapReduce approaches (it might be a little overkill however)

makapuf