views:

125

answers:

2

I'm using Beautiful soup to scrape data. The BS documentation states that BS should always return Unicode but I can't seem to get Unicode. Here's a code snippet

import urllib2
from libs.BeautifulSoup import BeautifulSoup

# Fetch and parse the data
url = 'http://wiki.gnhlug.org/twiki2/bin/view/Www/PastEvents2007?skin=print.pattern'

data = urllib2.urlopen(url).read()
print 'Encoding of fetched HTML : %s', type(data)

soup = BeautifulSoup(data)
print 'Encoding of souped up HTML : %s', soup.originalEncoding 

table = soup.table
print type(table.renderContents())

The original data returned from the page is a string. BS shows the original encoding as ISO-8859-1. I thought that BS automatically converted everything to Unicode so why is it that when I do this:

table = soup.table
print type(table.renderContents())

..it gives me a string object and not Unicode?

How can i get a Unicode objects from BS?

I'm really, really lost with this. Any help? Thanks in advance.

A: 

originalEncoding is exactly that - the source encoding, so the fact that BS is storing everything as unicode internally won't change that value. When you walk the tree, all text nodes are unicode, all tags are in unicode, etc., unless you otherwise convert them (say by using print, str, prettify, or renderContents).

Try doing something like:

soup = BeautifulSoup(data)
print type(soup.contents[0])

Unfortunately everything else you've done up to this point has found the very few methods in BS that convert to strings.

Nick Bastin
It gave me `<class 'libs.BeautifulSoup.BeautifulSoup.Declaration'>` for `type(soup.contents[0])` and `<type 'instance'>` for `type(soup.contents[2])`
Mridang Agarwalla
I looked at the BS source code and saw that to get Unicode strings, you have a have to call the `renderContents(None)`. This returns Unicode. I don't know why the documentation states otherwise.
Mridang Agarwalla
@mridang: yeah, I should have fed you a document to try that on - yours is well-formed and so the first few elements in `contents` are going to be metadata that create real `BeautifulSoup` objects. Either try to example in the documentation, or walk the tree for real and get tag names and text, without using the methods called out in the documentation as specifically *not* returning unicode (like `renderContents`).
Nick Bastin
A: 

As you may have noticed renderContent returns (by default) a string encoded in UTF-8, but if you really want a Unicode string representing the entire document you can also do unicode(soup) or decode the output of renderContents/prettify using unicode(soup.prettify(), "utf-8").

Related

Bruce van der Kooij