views:

95

answers:

2

I need to detect character encoding in HTTP responses. To do this I look at the headers, then if it's not set in the content-type header I have to peek at the response and look for a "<meta http-equiv='content-type'>" header. I'd like to be able to write a function that looks and works something like this:

response = urllib2.urlopen("http://www.example.com/")
encoding = detect_html_encoding(response)
...
page_text = response.read()

However, if I do response.read() in my "detect_html_encoding" method, then the subseuqent response.read() after the call to my function will fail.

Is there an easy way to peek at the response and/or rewind after a read?

A: 
  1. If it's in the HTTP headers (not the document itself) you could use response.info() to detect the encoding
  2. If you want to parse the HTML, save the response data:

    page_text = response.read()
    encoding = detect_html_encoding(response, page_text)
    
orip
It can be (1) in the headers, (2) in the document or (3) absent (in which case I have to use chardet to detect it based on the characters in the document).I can obviously pull the text out ahead of time, but the particular thing I'd like to do is basically allow me to avoid this type of approach.
John
+3  A: 
def detectit(response):
   # try headers &c, then, worst case...:
   content = response.read()
   response.read = lambda: content
   # now detect based on content

The trick of course is ensuring that response.read() WILL return the same thing again if needed... that's why we assign that lambda to it if necessary, i.e., if we already needed to extract the content -- that ensures the same content can be extracted again (and again, and again, ...;-).

Alex Martelli