ansaurus

Question

Answer 1

+1 A:

If requestHandler.read() delivers a UTF-8 encoded stream, then

pageData = requestHandler.read().decode('utf-8')

will decode this into a Unicode string (at which point, as Dietrich Epp noted correctly), the unicode() call is not necessary anymore.

If it throws an exception, then the input is obviously not UTF-8-encoded.

Tim Pietzcker 2010-05-14 14:09:23

At which point the call to `unicode` can be removed, as it's redundant.

Dietrich Epp 2010-05-14 14:12:33

Yeah, there is an error. I tried both: encode/decode, the same error.

Ockonal 2010-05-14 14:12:34

There isn't error during decoding of page data. Only near beautiful soup code.

Ockonal 2010-05-14 14:19:14

Could you update your question and show the actual error then?

Tim Pietzcker 2010-05-14 14:27:52

Answer 2

+2 A:

In your first snippet, the call unicode(requestHandler.read()) tells Python to convert the bytestring returned by read into unicode: since no code is specified for the conversion, ascii gets tried (and fails). It never gets to the point where you're going to call .decode (which would make no sense to call on that unicode object anyway).

Either use unicode(requestHandler.read(), 'utf-8'), or requestHandler.read().decode('utf-8'): either of these should produce a correct unicode object if the encoding is indeed utf-8 (the presence of that D0 byte suggests it may not be, but it's impossible to guess from being shown a single non-ascii character out of context).

printing Unicode data is a different issue and requires a well configured and cooperative terminal emulator -- one that lets Python set sys.stdout.encoding on startup. For example, on a Mac, using Apple's Terminal.App:

>>> sys.stdout.encoding
'UTF-8'

so the printing of Unicode objects works fine here:

>>> print u'\xabutf8\xbb'
«utf8»

as does the printing of utf8-encoded byte strings:

>>> print u'\xabutf8\xbb'.encode('utf8')
«utf8»

but on other machines only the latter will work (using the terminal emulator's own encoding, which you need to discover on your own because the terminal emulator isn't telling Python;-).

Alex Martelli 2010-05-14 14:17:28

Yeah, now script doesn't fail during page decoding, but in beautiful soup I get same error.

Ockonal 2010-05-14 14:24:19

@Ockonal, BeautifulSoup 3.0.8 works fine with unicode data (avoid 3.1.0!). I bet your problem is with the `print`ing, not with BeautifulSoup itself.

Alex Martelli 2010-05-14 14:30:01

Yeah, damn =) I used printing for searching my mistake. And it was near page-data decoding. Thanks, without printing this works well.

Ockonal 2010-05-14 14:32:57

ansaurus

tags:

views:

answers:

Trouble with encoding and urllib

1

2

related questions