Parsing HTML in Python

views:

1412

answers:

+5 Q:

Parsing HTML in Python

What's my best bet for parsing HTML if I can't use BeautifulSoup or lxml? I've got some code that uses SGMLlib but it's a bit low-level and it's now deprecated.

I would prefer if it could stomache a bit of malformed HTML although I'm pretty sure most of the input will be pretty clean.

+1 A:

Perhaps µTidylib will meet your needs?

Nick Presta 2009-04-04 18:14:20

http://www.xmlhack.com/read.php?item=1392 http://sourceforge.net/projects/pirxx/

http://pyxml.sourceforge.net/topics/

I don't have much experience with python, but I have used Xerces (from the Apache foundation) in the past and found it to be very useful. The learning curve isn't bad either, though I'm not coming from a python perspective. I suggest you consider it though. (The first two links I've included discuss python interfaces to Xerces and the last one is the first google hit on "python xml").

Joe 2009-04-04 18:29:55

I know you want an HTML parser, but these will be good starting places.

Joe 2009-04-04 18:31:41

+1 A:

Why can't you use BeautifulSoup? It's a simple one-file python-module, so if you don't get it "installed", just copy-and-paste it's contents into your actual script.

deets 2009-04-04 19:30:43

Beautiful Soup has a lot of problems that haven't been fixed yet for Python 3.

Robert Elwell 2009-04-04 19:33:23

also it doesnt stomach malformed html very well

Surya 2009-04-04 20:47:44

I thought dealing with malformed HTML was largely the point of BeautifulSoup?

andybak 2009-04-05 09:28:23

+2 A:

Python has a native HTML parser, however the Tidy wrapper Nick suggested would probably be a solid choice as well. Tidy is a very common library, (written in C is it?)

Andrei Taranchenko 2009-04-04 20:00:52

html5lib is good:
http://code.google.com/p/html5lib/

rudy 2010-06-04 11:51:24

ansaurus

tags:

views:

answers:

Parsing HTML in Python

related questions