views:

372

answers:

3

I'm looking for a good html parser like HtmlAgilityPack (open-source .NET project: http://www.codeplex.com/htmlagilitypack), but for using with Python.

Anyone knows?

+6  A: 

Use Beautiful Soup like everyone does.

Geo
BS has been succeeded by lxml at this point.
Wahnfrieden
A: 

Beautiful Soup should be something you search for. It is a html/xml parser that can deal with invalid pages and allows e.g. to iterate over specific tags.

dmeister
lxml handles invalid pages better than BS. And it's easier to iterate over tags with css selectors in lxml.html.
Wahnfrieden
+3  A: 

Others have recommended BeautifulSoup, but it's much better to use lxml. Despite its name, it is also for parsing and scraping HTML. It's much, much faster than BeautifulSoup, and it even handles "broken" HTML better than BeautifulSoup (their claim to fame). It has a compatibility API for BeautifulSoup too if you don't want to learn the lxml API.

Ian Blicking agrees.

There's no reason to use BeautifulSoup anymore, unless you're on Google App Engine or something where anything not purely Python isn't allowed.

Wahnfrieden
I have heard good things about lxml. People should try them out both and make a choice then.
Geo
I don't see a compelling reason to use BeautifulSoup, so I suggest going straight to lxml. Another useful thing is you can use css selectors with lxml - really simplifies things, in a familiar manner.
Wahnfrieden