ansaurus

Question

Getting international characters from a web page?

Answer 1

A:

I haven't tried it myself, but have you tried

http://zesty.ca/python/scrape.html ?

It seems to have a method htmldecode(text) which would do what you want.

Nick Fortescue 2008-09-10 00:32:23

Answer 2

A:

Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.

This blog entry seems to have had some success with it.

Jacob Rigby 2008-09-10 00:48:19

Answer 3

+5 A:

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup    
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

dF 2008-09-10 00:50:19

Replace `print soup.contents[0].string` by `print str(soup.contents[0].string)`. Otherwise It doesn't work in non-Unicode environment.

J.F. Sebastian 2008-09-10 02:38:00

ansaurus

tags:

views:

answers:

Getting international characters from a web page?

related questions