What would be the best library for multithreaded harvesting/downloading with multiple proxy support? I've looked at Tkinter, it looks good but there are so many, does anyone have a specific recommendation? Many thanks!
A:
Is this something you can't just do by passing a URL to newly spawned threads and calling urllib2.urlopen in each one, or is there a more specific requirement?
Kylotan
2009-10-20 20:36:02
urllib2 isn't thread safe from what I've seen, but I could of just been doing it wrong because I'm a noob to threading. I am downloading a lot of files so I'd rather use something a bit more powerful than just urllib anyway
Cookies
2009-10-20 20:40:55
It's almost certain to be thread-safe unless you do something inherently dangerous like trying to access the same object from multiple threads.
Kylotan
2009-10-20 22:10:59
A:
Also take a look at http://scrapy.org/, which is a scraping framework built on top of twisted.
twneale
2009-10-20 21:24:04
Excellent, I don't see anything about proxy support but I think I could do that myself.
Cookies
2009-10-20 21:36:35
No. Support for HTTP proxies is not currently implemented in Scrapy, but it will be in the future. For more information about this, follow this ticket. Setting the http_proxy environment variable won’t work because Twisted (the library used by Scrapy to download pages) doesn’t support it. See this Twisted ticket for more info.
Cookies
2009-10-20 21:39:02