views:

69

answers:

1

I have a code-base that I'm looking to split up and add to by using threading, however I'm relatively new on how to handle it. Please before reading further respect my wish of NOT just re-writing this code and tossing it back at me with the problem solved. I would much rather work the problem out by someone pointing me in the right direction, than someone solving it FOR me; I don't learn well that way.

The fully functioning code-base is here: http://pastebin.com/x0uSraEF -- It requires the mechanize and beautifulsoup libraries which can be installed via easy_install.

I've separated out all of my functions, and tried to keep the code as clean as possible (I'm sure there are some optimizations in there that I'll get reamed for, but the main problem is how to thread this.

My ultimate goal is to pack this into a thread, and then share cookies between other initialized browser objects in order to do other things while my original code is running 'backgrounded'.

I've tried thus:

class Recon(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)
        #Packed the stuff above my original while loop in here, minus functions.
    def run(self):
        #Packed my code past the while loop in here.
somevar = Recon()
somevar.start()

Problem I'm having is that, once I run the program it will run the things in init, but afterwards it just sits there and freezes on me. No traceback, no errors, just doesn't do anything, doesn't even return my command prompt back to my control.

Could I just get some tips, or a general flow of how to convert this? I got overwhelmed and deleted the code I was trying with so I don't have that example, but do I need to be prepending 'self.' to all of my variables? Do I need to just define my vars as global?

Here is a reproduction of what I'm having trouble with after having tried to convert the script to use threading: http://pastebin.com/tU9GFsi6

+3  A: 

As long as you have a single thread (as in the above snippet, where you instantiate Recon just once), it shouldn't matter much what you do where; but of course I imagine the reason you're introducing threading is to eventually move to having multiple threads active.

If that's the case, then the first key issue is to ensure that you never have two or more threads simultaneously trying to use the same shared system/resource -- for example, multiple threads writing at the same time to ReconFile, in the case of the code at the pastebin URL you mention.

The classic way to avoid such issues is to use locking, but my favorite way is quite different: make sure any such resource is accessed by only one dedicated thread, and use a Queue.Queue instance (intrinsically threadsafe) to have other threads post work-request to the dedicated thread (so instead of writing to ReconFile directly each other thread would make a list of lines to be written contiguously, then .put the list on the queue where the "recon file writing" worker thread is waiting via .get).

When you need to get results back from such actions (not the case here), the requesting thread would place its own personal "queue on which to return results" as part of the "work request packet" it puts to the worker thread's queue. I've presented much more detail about this recommended architecture in the threading chapter of "Python in a Nutshell" 2nd edition (and why, as the book's author, I would of course never recommend you perform an illegal download of a free pirate copy of my book, I can however mention there's plenty of sites offering such pirate copies for download -- the legal way to read my book for free is to sign up for a trial offer to O'Reilly's "safari" online books website).

This does not address the specific problem you're observing, since it's happening when you only have one thread around. I notice that thread is trying to perform lots of I/O on standard input and standard output, which is possibly problematic from a thread -- consider doing the input for a thread before you start it (in the main thread) and for needed output use Python's standard logging module, which is guaranteed to be thread-safe. Do you still observe problems then? If that's the case, then the next step is to pepper your code with logging.info calls so that you can pinpoint exactly where it's stalling -- and tell us about that, so we can try to help from there!

Alex Martelli
All of your assumptions are correct; I am trying to introduce threading now, to prepare for some future code which will only share a list and a cookiejar object. The idea is that I will be scraping data with one thread, and acting on that data in another thread (but the data being accessed will never be accessed at the same time)If I call init, and run...if my init is calling for raw input, would python be running my run function while waiting for input from init? or does init have to finish first?
ThantiK
@ThantiK, all I/O functions in Python "drop the GIL" (global interpreter lock) so other threads (with CPU tasks rather than I/O) can take over. But the `__init__` of a thread-subclass runs in the thread instantiating it (the main thread in the normal case, and in your case in particular) -- the new thread is born only upon the `start` method call and what it runs is the `run` method, only.
Alex Martelli
This is the result of my conversion, which SEEMS to be working fine now: http://pastebin.com/Pb582aF3 Any help making it cleaner or any criticism is welcome ;)
ThantiK
@ThantiK, I can't tell from that single thread what resources (stdin, stdout, `br`, etc) are going to be affected by multiple threads -- the only problem you're likely to have going forward (adding threads) is having more than one thread try to access a shared resource (which you can work around by locking, or better via a Queue, as I explained in my answer).
Alex Martelli