views:

659

answers:

1

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source."

isbn = 9780141187983
url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn
opener = urllib2.build_opener()
url_opener = opener.open(url)
page = url_opener.read()
html = BeautifulSoup(page) 
html #This line causes error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 21555: ordinal not in range(128)

I also tried...

html = BeautifulSoup(page.encode('utf-8'))

How can I read this web page into BeautifulSoup without getting this error?

+4  A: 

This error is probably actually happening when you try to print the representation of the BeautifulSoup file, which will happen automatically if, as I suspect, you are working in the interactive console.

# This code will work fine, note we are assigning the result 
# of the BeautifulSoup object to prevent it from printing immediately.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(u'\xa0')

# This will probably show the error you saw
print soup

# And this would probably be fine
print soup.encode('utf-8')
Triptych
This is correct. The error was encountered when trying to debug and printing the content to the screen. It is unfortunate that UTF-8 issues make debugging such a challenge, but the code does work correctly as long as I do not print.
Ryan Rosario
@Ryan - Trust me - I've been there. Glad this helped.
Triptych