ansaurus

Question

How can I break this multithreaded python script into "chunks"?

Answer 1

+2 A:

Your existing code will work beautifully - just modify your __init__ method inside Resolver to take in an additional list of addresses instead of one at a time, so instead of having one thread for each address, you have one thread for every 10 (for example). That way you won't overload the threading.

You'll obviously have to slightly modify run as well so it loops through the array of addresses instead of the one self.address.

I can work up a quick example if you'd like, but from the quality of your code I feel as though you'll be able to handle it quite easily.

Hope this helps!

EDIT Example below as requested. Note that you'll have to modify main to send your Resolver instance lists of addresses instead of a single address - I couldn't handle this for you without knowing more about the format of your file and how the addresses are stored. Note - you could do the run method with a helper function, but i thought this might be more understandable as an example

class Resolver(threading.Thread):
    def __init__(self, addresses, result_dict):
        threading.Thread.__init__(self)
        self.addresses = addresses  # Now takes in a list of multiple addresses
        self.result_dict = result_dict

    def run(self):
        for address in self.addresses: # do your existing code for every address in the list
            try:
                content = urllib.urlopen("http://www.siteadvisor.com/sites/" + address).read(12000)
                search1 = content.find("didn't find any significant problems.")
                search2 = content.find('yellow')
                search3 = content.find('web reputation analysis found potential security')
                search4 = content.find("don't have the results yet.")

                if search1 != -1:
                    result = "safe"
                elif search2 != -1:
                    result = "caution"
                elif search3 != -1:
                    result = "warning"
                elif search4 != -1:
                    result = "unknown"
                else:
                    result = ""

                self.result_dict[address] = result
            except:
                pass

nearlymonolith 2010-06-28 18:26:24

Can you post an example, please?

Tom 2010-06-28 18:29:32

I just edited my answer with a quick example - you'll have to edit your `run` method to pass in lists of addresses instead of a single address, but I left that to you as I don't know how your input file, etc. is formatted and I don't want to pass along broken code. As you can see it's a minor change.

nearlymonolith 2010-06-28 18:54:35

So in this case, would you be dividing your address list by the number of threads you're allowing to be created, and passing those sections of the address list to Resolver? If so, then you may want to move all the strip() stuff out of main and into Resolver.

andyortlieb 2010-06-28 19:08:55

Exactly! I just wanted to avoid confusing the issue of getting fewer threads running with the issue of how exactly to put the lists of addresses into the Resolver. It would probably be best, as you suggested, to pass blocks of `n` lines at a time, and then pre-process those lines in the Resolver `__init__` method before `run` deals with them.

nearlymonolith 2010-06-28 19:29:24

Answer 2

+2 A:

This might be kind of rigid, but you could pass threads to Resolver, so that when Resolver.run is completed, it can call threads.remove(self)

Then you can nest some conditions so that threads are only created if there is room for them, and if there isn't room, they wait until there is.

for address in [address.strip() for address in intext if address.strip()]:
        loop=True
        while loop:
            if len(threads)<20:
                resolver_thread = Resolver(address, results, threads)
                threads.append(resolver_thread)
                resolver_thread.start()
                loop=False
            else: 
                time.sleep(.25)

andyortlieb 2010-06-28 18:42:10

Thanks for your help so far. I've implemented your changes. However, the script only gets as far as 20 domains. I've put my script above. Do you know what the problem is?

Tom 2010-06-28 21:56:25

I believe all you need is to change threads.remove(self) to self.threads.remove(self)

andyortlieb 2010-06-28 22:28:58

Facepalm. Didn't see that. Thanks for your help.

Tom 2010-06-28 22:35:58

You're going to want to get rid of that try: block with no exceptions to troubleshoot why 50% of the time the script doesn't terminate (I'm willing to bet it's because exceptions aren't being properly handled). It could be an exception either from urllib or perhaps from racing to that csv file. So I have a couple more suggestions: Actually do something on that except block... AT LEAST remove self from self.threads again. Another suggestion: rather than writing to the csv from the thread, write to a common dictionary, and write the csv from main() after all the threads have ended.

andyortlieb 2010-06-28 22:40:26

Sorry, just one last remark, circumventing the try: block altogether, i had no problems with the code hanging. Don't know why, but thought I would share FYI. Thank you for the acceptance.

andyortlieb 2010-06-28 22:49:38

ansaurus

tags:

views:

answers:

How can I break this multithreaded python script into "chunks"?

related questions