tags:

views:

44

answers:

2

I have a scraper, which queries different websites. Some of them varyingly use Content-Encoding. And since I'm trying to simulate an AJAX query and need to mimic Mozilla, I need full support. There are multiple HTTP libraries for Python, but neither seems complete:

httplib seems pretty low level, more like a HTTP packet sniffer really.

urllib2 is some sort of elaborate hoax. There are a dozen handlers for various web client functions, but mandatory HTTP features like Content-Encoding appearantly aren't.

mechanize: is nice, already somehwat overkill for my tasks, but only supports CE 'gzip'.

httplib2: sounded most promising, but actually fails on 'deflate' encoding, because of the disparity of raw deflate and zlib streams.

So are there any other options? I can't believe I'm expected to reimplement workarounds for above libraries. And it's not a good idea to distribute patched versions alongside my application, because packagers might remove it again if the according library is available as separate distribution package.

I almost don't dare to say, but the http functions API in PHP is much nicer. And besides Content-Encoding:*, I might somewhen need multipart/form-data too. So, is there a comprehensive 3rd party library for http retrieval?

A: 

Beautiful Soup might work. Just throwing it out there.

karlw
BeautifulSoup is for parsing HTML and similar markups. It doesn't deal with HTTP.
Peter Lyons
+1  A: 

I would consider either invoking a child process of cURL or using python bindings for libcurl.

From this description cURL seems to support gzip and deflate.

Peter Lyons
I prefer wget over curl for cmdline work, and thus was a little relunctant because PycURL is also a non-standard extension. But it's probably the most mature and feature-complete solution around, so really the best choice.
mario