ansaurus

Question

Encoding problem downloading HTML using mechanize and Python 2.6

Answer 1

+1 A:

u = html.decode('utf-8')

Ned Batchelder 2010-09-27 14:23:43

UnicodeDecodeError: 'utf8' codec can't decode byte 0x8b in position 1: unexpected code byte

how 2010-09-27 14:26:15

Then it isn't utf-8. You should examine the headers to see what character set is being returned.

Ned Batchelder 2010-09-27 14:50:02

facebook.com, Content-Type: text/html; charset=utf-8

how 2010-09-27 14:53:12

@how: Save the output to a file and open it in a UTF-aware editor or use a hex editor to see whether it's really UTF-8 what mechanize returns.

AndiDog 2010-09-27 15:07:47

Sorry, it was gzipped. Forgot to ungzip :)

how 2010-09-27 15:19:37

Answer 2

A:

It was gzipped

def ungzipResponse(r,b):
    headers = r.info()
    if headers['Content-Encoding']=='gzip':
        import gzip
        gz = gzip.GzipFile(fileobj=r, mode='rb')
        html = gz.read()
        gz.close()
        headers["Content-type"] = "text/html; charset=utf-8"
        r.set_data( html )
        b.set_response(r)

response = browser.open(url)
ungzipResponse(response, browser)
html = response.read()

how 2010-09-27 15:19:03

Depending on what you're trying to do, you should really use an existing library that handles low-level stuff like this for you.

AndiDog 2010-09-27 15:20:24

I've tried to find how to make it easier but it seems like the simplest solution.

how 2010-09-27 15:39:18

Answer 3

A:

you need to define the encoding like :

#!/usr/bin/python
# -*- coding: iso-8859-15 -*-

mechanize need it .

for more information check this out http://www.python.org/dev/peps/pep-0263/

Gunslinger_ 2010-10-03 12:30:53

It was already there: # -*- coding: utf-8 -*-

how 2010-10-03 13:11:19

but you need : # -*- coding: iso-8859-15 -*-not # -*- coding: utf-8 -*- –

Gunslinger_ 2010-10-03 15:18:28

ansaurus

tags:

views:

answers:

Encoding problem downloading HTML using mechanize and Python 2.6

related questions