tags:

views:

185

answers:

4

Hi there.. I have been using BeautifulSoup but as I understand it that library is no longer being maintained. So what should I use ? I have heard about Xpath but what else is there ?

A: 

Well, if you're not duty-bound to python, you could always use a TagSoup parser. It's a Java library, but it gives very good results. You could also just use Tidy to clean your input before trying to parse it.

Borealid
Python is all I know and am learning at the moment..
Peter Nielsen
+2  A: 

Try lxml lib: http://codespeak.net/lxml/

Roki
Actually, I did.. BeautifulSoup seems a lot easier
Peter Nielsen
+10  A: 

There was a bugfix release in April, so I'm not even sure where you get the idea that it's no longer being maintained. However, even if that were true, BeautifulSoup is still plenty functional and I don't really see even the current implementation breaking down anytime soon. You might start having problems with HTML 5 in the next 2 years (although there are far fewer quirks so it's easier to parse, at least so far), but there's no particular reason not to use BeautifulSoup. The community is still active with support, etc. on the google group, and obviously the source code is available to you to enhance as you require.

Nick Bastin
Cool.. thank you very much :-)
Peter Nielsen
+5  A: 

I would steer clear of lxml, its too fussy for my taste. I'd try html5lib if I were you. It not only parses html, but deals robustly with the sort of errors you see in the tag soup known as invalid html.

It even has a BeautifulSoup emulation mode, producing a parse tree in the Beautiful Soup form to ease porting old code across:

import html5lib
from html5lib import treebuilders

f = open("mydocument.html")
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
minidom_document = parser.parse(f)
fmark
Have you also tried lxml.html (instead of lxml.etree)? I've had good experiences with it, even with pretty bad tag soup.Also note that you can use the html5lib parser with lxml too.
Steven
No, I haven't but I will now :)
fmark
I think I'll stick with BeautifulSoup
Peter Nielsen