ansaurus

Question

Converting graph traversal to multiprocessing in Python

Answer 1

A:

Queues in Python are synchronized. This means that only one thread at a time can read/write, this will definitely provoke a bottleneck in your app.

One better solution is to distribute the processing based on a hash function and assign the processing to the threads with a simple module operation. So for instance if you have 4 threads you could have 4 queues:

 thread_queues = []
 for i in range(4):
     thread_queues = Queue()

 for user in user_list:
    user_hash=hash(user.user_id) #hash in here is just shortcut to some standard hash utility 
    thread_id = user_hash % 4
    thread_queues[thread_id].put(user)

 # From here ... your pool of threads access thread_queues but each thread ONLY accesses 
 # one queue based on a numeric id given to each of them.

Most of hash functions will distribute evenly your data. I normally use UMAC. But maybe you can just try with the hash function from the Python String implementation.

Another improvement would be to avoid the use of Queues and use a non-sync object, such a list.

msalvadores 2010-08-16 08:14:48

I finally got around to trying this out and the synchronization was the bottleneck. The trouble was deciding what to partition the data on.

dcolish 2010-10-28 20:58:42

ansaurus

tags:

views:

answers:

Converting graph traversal to multiprocessing in Python

related questions