How to pick a chunksize for python multiprocessing with large datasets

views:

139

answers:

+1 Q:

How to pick a chunksize for python multiprocessing with large datasets

I am attempting to to use python to gain some performance on a task that can be highly parallelized using http://docs.python.org/library/multiprocessing.

When looking at their library they say to use chunk size for very long iterables. Now, my iterable is not long, one of the dicts that it contains is huge: ~100000 entries, with tuples as keys and numpy arrays for values.

How would I set the chunksize to handle this and how can I transfer this data quickly?

Thank you.

+1 A:

The only way to handle this single large item in multiple workers at once is by splitting it up. multiprocessing works by dividing up the work in units, but the smallest unit you can feed it is one object -- it can't know how to split up a single object in a way that's sensible. You have to do it yourself, instead. Instead of sending over the dicts to be worked on, split up the dicts to smaller work units and send those over instead. If you can't split the dict because all the data is interdependent, then you can't really split up the work either.

Thomas Wouters 2010-04-24 21:41:56

Ah ok makes sense. Currently the workers are each grabbing the giant dict, making a copy of it, modifying it and then sending back their version. (not exactly light weight). Since you seem to the guy who knows his python multiprocessing: If the giant dict where to be read only is there a way to let all of the workers access its data as needed efficiently? (this would be easy with threads, but with multiprocessing it gets tricky fast it seems)

Sandro 2010-04-24 21:59:53

If you're not on Windows, and you make this 'read-only' dict be part of the process *before* you spawn workers, and store it in (for example) a global or an enclosed local, all of the workers can access it without suffering the serialization cost.

Thomas Wouters 2010-04-24 22:26:30

Uh oh, I'm just realizing now that I'm using the wrong terminology. I'm actually using the Pool.map_async() function to do all of this. Am I right to assume that by using map there is no solution, only by forking. Is there a serious cost to joining back together?

Sandro 2010-04-24 23:27:26

ansaurus

tags:

views:

answers:

How to pick a chunksize for python multiprocessing with large datasets

related questions