views:

393

answers:

3

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.

Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.

+3  A: 

From the BeautifulSoup documentation:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does

So, it might help you to use this earlier version. That is precisely what the author himself recommends.

You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.

Lakshman Prasad
Thanks for that - I got better performance with 3.0.8 but it still failed to parse the webpage usefully.Also the BS author has lost interest in developing it further so I had better invest time elsewhere.
Plumo
A: 

I decided to go with html5lib

Plumo
A: 

Did you manage to use ElementTree API from html5lib under gae?

Sergey
yes [junk to fill out 15 chars]
Plumo
took me some digging to find out the namespaceHTMLElements bug/feature workaround...
Sergey