ansaurus

Question

read multiple files using multiprocessing

Answer 1

A:

Multiprocessing is more suited to CPU- or memory-oriented processes since the seek time of rotational drives kills performance when switching between files. Either load your log files into a fast flash drive or some sort of memory disk (physical or virtual), or give up on multiprocessing.

Ignacio Vazquez-Abrams 2010-01-15 02:29:59

my problem is cpu-bounded and not IO-bounded. In this example I'm splitting lines, but in the real case I'm working with a complex and long regex and the IO time (seek, ...) is much minor then the cpu time

wiso 2010-01-15 09:12:11

Answer 2

A:

You're creating a pool with as many workers as files. That may be too many. Usually, I aim to have the number of workers around the same as the number of cores.

The simple fact is that your final step is going to be a single process merging all the results together. There is no avoiding this, given your problem description. This is known as a barrier synchronization: all tasks have to reach the same point before any can proceed.

You should probably run this program multiple times, or in a loop, passing a different value to multiprocessing.Pool() each time, starting at 1 and going to the number of cores. Time each run, and see which worker count does best.

The result will depend on how CPU-intensive (as opposed to disk-intensive) your task is. I would not be surprised if 2 were best if your task is about half CPU and half disk, even on an 8-core machine.

Mike DeSimone 2010-01-15 02:40:58

Yes I already did it. My choice was not a random choice. I tried to measure time without the return line, and the best choice was when the number of processes is equal to the number of file even if the number of process is greater than the number of cores.

wiso 2010-01-15 09:21:46

Then I don't see how you can do better. The killer is: "It is important that the order of the lines is preserved." This can only be done one input at a time, even though you preprocessed each file independently. Your alternative would be to have each worker generate a file with a suffix, and have whatever reads these files read them in order, so the merge gets eliminated.

Mike DeSimone 2010-01-15 12:25:39

Answer 3

+2 A:

You're probably hitting two problems.

One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

Two threads: 0.36elapsed 168%CPU

One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

Glenn Maynard 2010-01-15 04:36:36

Yes, this is the problem: return data. I'm using multiprocessing and not multithreading because of GIL. But I want to optimize my program using all the core of my cpu! If I mesure the time from "Start reading file" and "Readeng %d lines" (ignoring the return time) the multiprocessing version is 2 times faster than the single process version (I've 2 core). Now: what about shared memory? I looked at the multiprocess.Manager class, but I want to share a structure more complex than dict.

wiso 2010-01-15 09:18:48

I havn't used Manager, but it looks like it proxies manipulation of data, so I suspect it's even slower. You can use shared memory to share simple blocks of memory, but not native Python types. You might want to look for other optimizations, but without any actual code I can't make any suggestions.

Glenn Maynard 2010-01-15 11:40:27

maybe a solution can be: rewrite the read function in c++ and use real multithreading with c++? With this approach can I bypass the problem to share data between processes (pipe)?

wiso 2010-01-15 20:04:00

You'll always need to hold the GIL while you construct the Python data structures, and it's a lot more work (on your part) to do your parsing and constructing the results in C than in Python. I can't say whether this would be a good idea or not, but it sounds like a mess.

Glenn Maynard 2010-01-16 01:31:14

ansaurus

tags:

views:

answers:

read multiple files using multiprocessing

related questions