ansaurus

Question

Answer 1

+1 A:

Python 3 is a lot more strict when it comes to the difference between bytes and (Unicode) strings. The result of urlopen(...).read(...) is of course an object of type bytes, and the implementation of bytes.find doesn't allow you to search for Unicode strings. In your case, you can simply replace "pricewrap" by a binary string:

idx_pricewrap = b.find(b'pricewrap')

Same applies to other .find calls. Python 2 encoded Unicode strings automatically where it made (less or more) sense, but Python 3 has introduced more restrictions that you need to be aware of.

AndiDog 2010-10-25 21:01:35

Thanks very much. Before I saw your answer I found a relevant example in the docs, which I think does what you suggest in a different way. I'll answer my own question to show this.

NotSuper 2010-10-25 21:15:43

@NotSuper: Yes, decoding the website to a Unicode object is a good solution as well. Actually it's the better solution, but for HTML sites you might rather want to use a parser library that can detect the charset automatically (from the HTTP header or the charset definition inside the HTML, instead of assuming UTF-8).

AndiDog 2010-10-25 21:37:04

@AndiDog: I'd like to learn how to use a parser library. Could you point me to some examples? I assume I could do this with what 3.1 has?

NotSuper 2010-10-26 00:11:48

@NotSuper: There's a well-known library called [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/) to do just that. It has a version compatible with Python 3.1. As of the documentation, it automatically produces Unicode strings from HTML input, but I don't know how you can pass the "Content-Type" header to it in case the HTML itself doesn't declare the charset. I haven't used it myself but there are a lot of questions about it on SO so you can get help here.

AndiDog 2010-10-26 06:52:43

@AndiDog: Thanks. I'll look into BeautifulSoup.

NotSuper 2010-10-26 18:03:59

Answer 2

A:

I finally find a relevant example in the docs:

http://docs.python.org/py3k/library/urllib.request.html?highlight=urllib#examples

The first example gave me some understanding and led me to revising my code to

http://tutoree7.pastebin.com/sUq8s4wh

which works like a charm.

NotSuper 2010-10-25 21:17:45

ansaurus

tags:

views:

answers:

Python 3.1 code and error

related questions