ansaurus

Question

Answer 1

A:

originalEncoding is exactly that - the source encoding, so the fact that BS is storing everything as unicode internally won't change that value. When you walk the tree, all text nodes are unicode, all tags are in unicode, etc., unless you otherwise convert them (say by using print, str, prettify, or renderContents).

Try doing something like:

soup = BeautifulSoup(data)
print type(soup.contents[0])

Unfortunately everything else you've done up to this point has found the very few methods in BS that convert to strings.

Nick Bastin 2010-07-07 07:20:51

It gave me `<class 'libs.BeautifulSoup.BeautifulSoup.Declaration'>` for `type(soup.contents[0])` and `<type 'instance'>` for `type(soup.contents[2])`

Mridang Agarwalla 2010-07-07 07:31:04

I looked at the BS source code and saw that to get Unicode strings, you have a have to call the `renderContents(None)`. This returns Unicode. I don't know why the documentation states otherwise.

Mridang Agarwalla 2010-07-07 08:27:29

@mridang: yeah, I should have fed you a document to try that on - yours is well-formed and so the first few elements in `contents` are going to be metadata that create real `BeautifulSoup` objects. Either try to example in the documentation, or walk the tree for real and get tag names and text, without using the methods called out in the documentation as specifically *not* returning unicode (like `renderContents`).

Nick Bastin 2010-07-07 16:15:22

Answer 2

A:

As you may have noticed renderContent returns (by default) a string encoded in UTF-8, but if you really want a Unicode string representing the entire document you can also do unicode(soup) or decode the output of renderContents/prettify using unicode(soup.prettify(), "utf-8").

Related

How to render contents of a tag in unicode in BeautifulSoup?

Bruce van der Kooij 2010-08-10 20:53:24

ansaurus

tags:

views:

answers:

BeautifulSoup doesn't give me Unicode

related questions