views:

85

answers:

3

I have a cpu intensive code which uses a heavy dictionary as data (around 250M data). I have a multicore processor and want to utilize it so that i can run more than one task at a time. The dictionary is mostly read only and may be updated once a day.
How can i write this in python without duplicating the dictionary?
I understand that python threads don't use native threads and will not offer true concurrency. Can i use multiprocessing module without data being serialized between processes?

I come from java world and my requirement would be something like java threads which can share data, run on multiple processors and offers synchronization primitives.

+1  A: 

Use shelve for the dictionary. Since writes are infrequent there shouldn't be an issue with sharing it.

Ignacio Vazquez-Abrams
from the docs it seems that a value fetch will return a copy of data using pickle? a task for me typically access 1/3 of dict which would mean lots of temp objects in this approach
shibin
Then there's no good solution in CPython for your issue. Only a series of not-quite-awful ones, all involving some sort of database.
Ignacio Vazquez-Abrams
+1  A: 

You can share read-only data among processes simply with a fork (on Unix; no easy way on Windows), but that won't catch the "once a day change" (you'd need to put an explicit way in place for each process to update its own copy). Native Python structures like dict are just not designed to live at arbitrary addresses in shared memory (you'd have to code a dict variant supporting that in C) so they offer no solace.

You could use Jython (or IronPython) to get a Python implementation with exactly the same multi-threading abilities as Java (or, respectively, C#), including multiple-processor usage by multiple simultaneous threads.

Alex Martelli
A: 

Take a look at this in the stdlib: http://docs.python.org/library/multiprocessing.html There are a bunch of wonderful features that will allow you to share data structures between processes very easily.

fridder