views:

67

answers:

1

I'm new to Python and multithreading, so please bear with me.

I'm writing a script to process domains in a list through Web of Trust, a service that ranks websites from 1-100 on a scale of "trustworthiness", and write them to a CSV. Unfortunately Web of Trust's servers can take quite a while to respond, and processing 100k domains can take hours.

My attempts at multithreading so far have been disappointing -- attempting to modify the script from this answer gave threading errors, I believe because some threads took too long to resolve.

Here's my unmodified script. Can someone help me multithread it, or point me to a good multithreading resource? Thanks in advance.

import urllib
import re

text = open("top100k", "r")
text = text.read()
text = re.split("\n+", text)

out = open('output.csv', 'w')

for element in text:
        try:
                content = urllib.urlopen("http://api.mywot.com/0.4/public_query2?target=" + element)
                content = content.read()
                content = content[content.index('<application name="0" r="'):content.index('" c')]
                content = element + "," + content[25] + content[26] + "\n"
                out.write(content)
        except:
                pass
+1  A: 

A quick scan through the WoT API documentation shows that as well as the public_query2 request that you are using, there is a public_query_json request that lets you get the data in batches of up to 100. I would suggest using that before you start flooding their server with lots of requests in parallel.

Dave Kirby
Thanks for the answer.
Tom