ansaurus

Question

Increasing throughput in a python script

Answer 1

+2 A:

The vast majority of the time here is spent in the external calls to dig, so to improve that speed, you'll need to multithread. This will allow you to run multiple calls to dig at the same time. See for example: http://stackoverflow.com/questions/984941/python-subprocess-popen-from-a-thread . Or, you can use Twisted ( http://twistedmatrix.com/trac/ ).

EDIT: You're correct, much of that was unnecessary.

gilesc 2010-06-22 00:57:25

+1 for mentioning the need to thread calls to dig. You should put this foremost in your answer.

Matt Joiner 2010-06-22 01:23:34

Answer 2

A:

I'd consider using a pure-Python library to do the DNS queries, rather than delegating to dig, because invoking another process can be relatively time-consuming. (Of course, looking up anything on the internet is also relatively time-consuming, so what gilesc said about multithreading still applies) A Google search for python dns will give you some options to get started with.

David Zaslavsky 2010-06-22 01:41:22

the overhead of starting new processes is negligible compared to the time taken to execute a DNS query

Matt Joiner 2010-06-22 04:14:34

Yes, but the network delay was already covered by gilesc's answer.

David Zaslavsky 2010-06-22 05:25:21

Answer 3

A:

In order to keep pace with the server updates, one must take less than 15 minutes to execute. Does your script take 15 minutes to run? If it doesn't take 15 minutes, you're done!

I would investigate caching and diffs from previous runs in order to increase performance.

Arafangion 2010-06-22 03:34:21

he's already mentioned it takes several hours

Matt Joiner 2010-06-22 04:13:53

Answer 4

+2 A:

Well, it's probably the name resolution that's taking you so long. If you count that out (i.e., if somehow dig returned very quickly), Python should be able to deal with thousands of entries easily.

That said, you should try a threaded approach. That would (theoretically) resolve several addresses at the same time, instead of sequentially. You could just as well continue to use dig for that, and it should be trivial to modify my example code below for that, but, to make things interesting (and hopefully more pythonic), let's use an existing module for that: dnspython

So, install it with:

sudo pip install -f http://www.dnspython.org/kits/1.8.0/ dnspython

And then try something like the following:

import threading
from dns import resolver

class Resolver(threading.Thread):
    def __init__(self, address, result_dict):
        threading.Thread.__init__(self)
        self.address = address
        self.result_dict = result_dict

    def run(self):
        try:
            result = resolver.query(self.address)[0].to_text()
            self.result_dict[self.address] = result
        except resolver.NXDOMAIN:
            pass


def main():
    infile = open("domainlist", "r")
    intext = infile.readlines()
    threads = []
    results = {}
    for address in [address.strip() for address in intext if address.strip()]:
        resolver_thread = Resolver(address, results)
        threads.append(resolver_thread)
        resolver_thread.start()

    for thread in threads:
        thread.join()

    outfile = open('final.csv', 'w')
    outfile.write("\n".join("%s,%s" % (address, ip) for address, ip in results.iteritems()))
    outfile.close()

if __name__ == '__main__':
    main()

If that proves to start too many threads at the same time, you could try doing it in batches, or using a queue (see http://www.ibm.com/developerworks/aix/library/au-threadingpython/ for an example)

rbp 2010-06-22 11:43:55

Works great, but I keep on getting these errors:File "/home/okim/dnspython/dns/resolver.py", line 541, in _compute_timeout raise TimeoutTimeoutAny ideas?

Tom 2010-06-22 21:10:38

Apparently your nameserver is unable to resolve some of the names within the default timeout (possibly the authoritative nameserver is not responding). If you want to skip those, simply change the "except resolver.NXDOMAIN:" line to "except (resolver.NXDOMAIN, resolver.Timeout):". If you want to treat those exceptions differently, just add a new except clause, after the NXDOMAIN one. Incidentally, NXDOMAIN captures non-existent domains.

rbp 2010-06-22 21:45:02

BTW, if you find out that the problem is your nameserver, you can specify one nominally with "resolver.default_resolver.nameservers.insert(0, '8.8.8.8')". But I'm just saying it for completeness, it's more likely that, within thousands of domain names, some of them simply aren't responding (especially since you mention they come from a blacklist).

rbp 2010-06-22 21:48:09

Yeah, my nameserver has problems. Changing the "except" works though. Thanks for your help!

Tom 2010-06-23 01:00:51

No problem, good luck :)

rbp 2010-06-23 01:10:30

ansaurus

tags:

views:

answers:

Increasing throughput in a python script

related questions