views:

72

answers:

2

Hello,

I wrote a python script that: 1. submits search queries 2. waits for the results 3. parses the returned results(XML)

I used the threading and Queue modules to perform this in parallel (5 workers).
It works great for the querying portion because i can submit multiple search jobs and deal with the results as they come in.
However, it appears that all my threads get bound to the same core. This is apparent when it gets to the part where it processes the XML(cpu intensive).

Has anyone else encountered this problem? Am i missing something conceptually?

Also, i was pondering the idea of having two separate work queues, one for making the queries and one for parsing the XML. As it is now, one worker will do both in serial. I'm not sure what that will buy me, if anything. Any help is greatly appreciated.

Here is the code: (proprietary data removed)

def addWork(source_list):
    for item in source_list:
        #print "adding: '%s'"%(item)
        work_queue.put(item)

def doWork(thread_id):
    while 1:
        try:
            gw = work_queue.get(block=False)
        except Queue.Empty:
            #print "thread '%d' is terminating..."%(thread_id)
            sys.exit() # no more work in the queue for this thread, die quietly

    ##Here is where i make the call to the REST API
    ##Here is were i wait for the results
    ##Here is where i parse the XML results and dump the data into a "global" dict

#MAIN
producer_thread = Thread(target=addWork, args=(sources,))
producer_thread.start() # start the thread (ie call the target/function)
producer_thread.join() # wait for thread/target function to terminate(block)

#start the consumers
for i in range(5):
    consumer_thread = Thread(target=doWork, args=(i,))
    consumer_thread.start()
    thread_list.append(consumer_thread)

for thread in thread_list:
    thread.join()
+2  A: 

This is a byproduct of how CPython handles threads. There are endless discussions around the internet (search for GIL) but the solution is to use the multiprocessing module instead of threading. Multiprocessing is built with pretty much the same interface (and synchronization structures, so you can still use queues) as threading. It just gives every thread its own entire process, thus avoiding the GIL and forced serialization of parallel workloads.

Nathon
You have to IPC your information around, though. Even though the modules have the same interface, don't assume the same is happening -- queuing and dequeuing in a thread-safe queue doesn't involve copying data from a process to another, for example. I'll run out of characters, see my answer below.
Santiago Lezica
That worked! Thank you very much for the help.
nnachefski
A: 

Using CPython, your threads will never actually run in parallel in two different cores. Look up information on the Global Interpreter Lock (GIL).

Basically, there's a mutual exclusion lock protecting the actual execution part of the interpreter, so no two threads can compute in parallel. Threading for I/O tasks will work just fine, because of blocking.

edit: If you want to fully take advantage of multiple cores, you need to use multiple processes. There's a lot of articles about this topic, I'm trying to look one up for you I remember was great, but can't find it =/.

As Nathon suggested, you can use the multiprocessing module. There are tools to help you share objects between processes (take a look at POSH, Python Object Sharing).

Santiago Lezica