views:

29

answers:

1

When I use the unicode function in BeautifulSoup - what encoding does it convert to Unicode from? Does it automatically use the soup.originalEncoding?

from BeautifulSoup import BeautifulSoup
doc = "<html><h1>Heading</h1><p>Text"
soup = BeautifulSoup(doc)
print unicode(soup)

Thanks

+1  A: 

unicode() is a Python builtin, not part of BeautifulSoup. See the docs here.

unicode([object[, encoding[, errors]]])

If encoding and/or errors are given, unicode() will decode the object which can either be an 8-bit string or a character buffer using the codec for encoding. The encoding parameter is a string giving the name of an encoding; if the encoding is not known, LookupError is raised. Error handling is done according to errors; this specifies the treatment of characters which are invalid in the input encoding. If errors is 'strict' (the default), a ValueError is raised on errors, while a value of 'ignore' causes errors to be silently ignored, and a value of 'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded. See also the codecs module.

If you don't specify the encoding, sys.getdefaultencoding() will be used by default.

Tim Pietzcker
Types can override `unicode()` by implementing the special [`__unicode__()`](http://docs.python.org/reference/datamodel.html#object.__unicode__) method. If a type implements this method, the `unicode()` builtin simply returns the result of this method, and a type can basically return whatever it wants from this method. So the result of the call in the OP in fact depends on the implementation of the `BeautifulSoup` class.
lunaryorn
Good point. Does anyone know whether BeautifulSoup overrides the builtin function?
Tim Pietzcker