Hi there.. I have been using BeautifulSoup but as I understand it that library is no longer being maintained. So what should I use ? I have heard about Xpath but what else is there ?
Well, if you're not duty-bound to python, you could always use a TagSoup parser. It's a Java library, but it gives very good results. You could also just use Tidy to clean your input before trying to parse it.
There was a bugfix release in April, so I'm not even sure where you get the idea that it's no longer being maintained. However, even if that were true, BeautifulSoup is still plenty functional and I don't really see even the current implementation breaking down anytime soon. You might start having problems with HTML 5 in the next 2 years (although there are far fewer quirks so it's easier to parse, at least so far), but there's no particular reason not to use BeautifulSoup. The community is still active with support, etc. on the google group, and obviously the source code is available to you to enhance as you require.
I would steer clear of lxml
, its too fussy for my taste. I'd try html5lib
if I were you. It not only parses html, but deals robustly with the sort of errors you see in the tag soup known as invalid html.
It even has a BeautifulSoup emulation mode, producing a parse tree in the Beautiful Soup form to ease porting old code across:
import html5lib
from html5lib import treebuilders
f = open("mydocument.html")
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
minidom_document = parser.parse(f)