ansaurus

Question

Is it possible to peek at the data in a urllib2 response?

Answer 1

A:

If it's in the HTTP headers (not the document itself) you could use response.info() to detect the encoding

If you want to parse the HTML, save the response data:

page_text = response.read()
encoding = detect_html_encoding(response, page_text)

orip 2009-08-20 20:30:44

It can be (1) in the headers, (2) in the document or (3) absent (in which case I have to use chardet to detect it based on the characters in the document).I can obviously pull the text out ahead of time, but the particular thing I'd like to do is basically allow me to avoid this type of approach.

John 2009-08-20 20:41:36

Answer 2

+3 A:

def detectit(response):
   # try headers &c, then, worst case...:
   content = response.read()
   response.read = lambda: content
   # now detect based on content

The trick of course is ensuring that response.read() WILL return the same thing again if needed... that's why we assign that lambda to it if necessary, i.e., if we already needed to extract the content -- that ensures the same content can be extracted again (and again, and again, ...;-).

Alex Martelli 2009-08-21 02:05:26

ansaurus

tags:

views:

answers:

Is it possible to peek at the data in a urllib2 response?

related questions