HTML parser for GAE

views:

393

answers:

+2 Q:

HTML parser for GAE

Generally I use lxml for my HTML parsing needs, but that isn't available on Google App Engine. The obvious alternative is BeautifulSoup, but I find it chokes too easily on malformed HTML. Currently I am testing libxml2dom and have been getting better results.

Which pure Python HTML parser have you found performs best? My priority is the ability to handle bad HTML over speed.

+3 A:

From the BeautifulSoup documentation:

Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than version 3.0.8 does

So, it might help you to use this earlier version. That is precisely what the author himself recommends.

You can pretend that Beautiful Soup version 3.1.0 was never released. Version 3.0.8 still works fine on Python 2.3 through 2.6.

Lakshman Prasad 2010-01-29 12:32:26

Thanks for that - I got better performance with 3.0.8 but it still failed to parse the webpage usefully.Also the BS author has lost interest in developing it further so I had better invest time elsewhere.

Plumo 2010-02-02 01:42:16

I decided to go with html5lib

Plumo 2010-02-02 01:31:38

Did you manage to use ElementTree API from html5lib under gae?

Sergey 2010-04-26 21:21:47

yes [junk to fill out 15 chars]

Plumo 2010-04-27 05:23:49

took me some digging to find out the namespaceHTMLElements bug/feature workaround...

Sergey 2010-05-02 18:23:42

ansaurus

tags:

views:

answers:

HTML parser for GAE

related questions