views:

569

answers:

3

I want to scrape some information off a football (soccer) web page using simple python regexp's. The problem is that players such as the first chap, ÄÄRITALO, comes out as ÄÄRITALO!
That is, html uses escaped markup for the special characters, such as Ä

Is there a simple way of reading the html into the correct python string? If it was XML/XHTML it would be easy, the parser would do it.

A: 

I haven't tried it myself, but have you tried

http://zesty.ca/python/scrape.html ?

It seems to have a method htmldecode(text) which would do what you want.

Nick Fortescue
A: 

Try using BeautifulSoup. It should do the trick and give you a nicely formatted DOM to work with as well.

This blog entry seems to have had some success with it.

Jacob Rigby
+5  A: 

I would recommend BeautifulSoup for HTML scraping. You also need to tell it to convert HTML entities to the corresponding Unicode characters, like so:

>>> from BeautifulSoup import BeautifulSoup    
>>> html = "<html>&#196;&#196;RITALO!</html>"
>>> soup = BeautifulSoup(html, convertEntities=BeautifulSoup.HTML_ENTITIES)
>>> print soup.contents[0].string
ÄÄRITALO!

(It would be nice if the standard codecs module included a codec for this, such that you could do "some_string".decode('html_entities') but unfortunately it doesn't!)

EDIT: Another solution: Python developer Fredrik Lundh (author of elementtree, among other things) has a function to unsecape HTML entities on his website, which works with decimal, hex and named entities (BeautifulSoup will not work with the hex ones).

dF
Replace `print soup.contents[0].string` by `print str(soup.contents[0].string)`. Otherwise It doesn't work in non-Unicode environment.
J.F. Sebastian