views:

57

answers:

2

I'm writing a python script to read through a list of domains, find out what rating Mcafee's Siteadvisor service gives, then output the domain and result to a CSV.

I've based my script off this previous answer. It uses the urllib to scrape Siteadvisor's page for the domain in question (not the best method, I know, but Siteadvisor provides no alternative). Unfortunately, it fails to produce anything - I consistently get this error:

Traceback (most recent call last):
  File "multi.py", line 55, in <module>
    main()
  File "multi.py", line 44, in main
    resolver_thread.start()
  File "/usr/lib/python2.6/threading.py", line 474, in start
    _start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread

Here is my script:

import threading
import urllib

class Resolver(threading.Thread):
    def __init__(self, address, result_dict):
        threading.Thread.__init__(self)
        self.address = address
        self.result_dict = result_dict

    def run(self):
        try:
            content = urllib.urlopen("http://www.siteadvisor.com/sites/" + self.address).read(12000)
            search1 = content.find("didn't find any significant problems.")
            search2 = content.find('yellow')
            search3 = content.find('web reputation analysis found potential security')
            search4 = content.find("don't have the results yet.")

            if search1 != -1:
                result = "safe"
            elif search2 != -1:
                result = "caution"
            elif search3 != -1:
                result = "warning"
            elif search4 != -1:
                result = "unknown"
            else:
                result = ""

            self.result_dict[self.address] = result

        except:
            pass


def main():
    infile = open("domainslist", "r")
    intext = infile.readlines()
    threads = []
    results = {}
    for address in [address.strip() for address in intext if address.strip()]:
        resolver_thread = Resolver(address, results)
        threads.append(resolver_thread)
        resolver_thread.start()

    for thread in threads:
        thread.join()

    outfile = open('final.csv', 'w')
    outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
    outfile.close()

if __name__ == '__main__':
    main()

Any help would be greatly appreciated.

+1  A: 

It looks like you are trying to start too many threads.

You can check how many items are in [address.strip() for address in intext if address.strip()] list. I quess this is a problem here. Basically there is a limit of available resources that allows to start new threads.

The solution for this is to chunk your list to pieces of let's say 20 elements, do the stuff (in 20 threads), wait for threads to finish their jobs, and then pick up next chunk. Do this until all elements from your list are processed.

You can also use some thread pool for better threads management. (I recently used this implementation).

Lukasz Dziedzia
Sounds like a good idea. Thanks
Tom
Glad I could help you
Lukasz Dziedzia
+1  A: 

There's probably an upper limit to the number of threads you can create, and you're probably exceeding it.

Suggestion: Create a small, fixed number of Resolvers - under 10 will probably get you 90% of the possible parallelism benefit possible - and a (threadsafe) Queue from python's queue lib. Have the main thread dump all the domains into the queue, and have each Resolver take one domain at a time from the queue and work on it.

Russell Borogove