Is there a good html parser like HtmlAgilityPack (.NET) for Python?

views:

372

answers:

+1 Q:

Is there a good html parser like HtmlAgilityPack (.NET) for Python?

I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.

Anyone knows?

+6 A:

Use Beautiful Soup like everyone does.

Geo 2009-08-03 13:00:12

BS has been succeeded by lxml at this point.

Wahnfrieden 2009-08-03 20:35:18

Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.

dmeister 2009-08-03 13:02:05

lxml handles invalid pages better than BS. And it's easier to iterate over tags with css selectors in lxml.html.

Wahnfrieden 2009-08-03 20:35:53

+3 A:

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden 2009-08-03 15:31:44

I have heard good things about lxml. People should try them out both and make a choice then.

Geo 2009-08-03 18:48:36

I don't see a compelling reason to use BeautifulSoup, so I suggest going straight to lxml. Another useful thing is you can use css selectors with lxml - really simplifies things, in a familiar manner.

Wahnfrieden 2009-08-03 20:29:46

ansaurus

tags:

views:

answers:

Is there a good html parser like HtmlAgilityPack (.NET) for Python?

related questions