ansaurus

Question

BeautifulSoup gives me unicode+html symbols, rather than straight up unicode. Is this a bug or misunderstanding?

Answer 1

+3 A:

That's one seriously messed up page, encoding-wise :-)

There's nothing really wrong with your approach at all. I would probably tend to do the conversion before passing it to BeautifulSoup, just because I'm persnickity:

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('iso-8859-1')
soup = BeautifulSoup(h)

In this case, the page's meta tag is lying about the encoding. The page is actually in utf-8... Firefox's page info reveals the real encoding, and you can actually see this charset in the response headers returned by the server:

curl -i http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271
HTTP/1.1 200 OK
Connection: close
Date: Tue, 10 Mar 2009 13:14:29 GMT
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Set-Cookie: COMPANYID=271;path=/
Content-Language: en-US
Content-Type: text/html; charset=UTF-8

If you do the decode using 'utf-8', it will work for you (or, at least, is did for me):

import urllib
html = urllib.urlopen('http://www.coopamerica.org/programs/responsibleshopper/company.cfm?id=271').read()
h = html.decode('utf-8')
soup = BeautifulSoup(h)
ps = soup.body("p")
p = ps[52]
print p

Jarret Hardie 2009-03-10 13:15:41

Thank you so much for the informative and gentle response. It does indeed work for me, too.

2009-03-10 13:20:37

Answer 2

+2 A:

It's actually UTF-8 misencoded as CP1252:

>>> print u'Oxfam International\xe2€™s report entitled \xe2€œOffside!'.encode('cp1252').decode('utf8')
Oxfam International’s report entitled “Offside!

Ignacio Vazquez-Abrams 2009-03-10 13:21:42

ansaurus

tags:

views:

answers:

BeautifulSoup gives me unicode+html symbols, rather than straight up unicode. Is this a bug or misunderstanding?

related questions