ansaurus

Question

No more BeautifulSoup

Answer 1

A:

Well, if you're not duty-bound to python, you could always use a TagSoup parser. It's a Java library, but it gives very good results. You could also just use Tidy to clean your input before trying to parse it.

Borealid 2010-07-14 08:07:39

Python is all I know and am learning at the moment..

Peter Nielsen 2010-07-17 18:13:29

Answer 2

+2 A:

Try lxml lib: http://codespeak.net/lxml/

Roki 2010-07-14 08:08:17

Actually, I did.. BeautifulSoup seems a lot easier

Peter Nielsen 2010-07-17 18:14:01

Answer 3

+10 A:

There was a bugfix release in April, so I'm not even sure where you get the idea that it's no longer being maintained. However, even if that were true, BeautifulSoup is still plenty functional and I don't really see even the current implementation breaking down anytime soon. You might start having problems with HTML 5 in the next 2 years (although there are far fewer quirks so it's easier to parse, at least so far), but there's no particular reason not to use BeautifulSoup. The community is still active with support, etc. on the google group, and obviously the source code is available to you to enhance as you require.

Nick Bastin 2010-07-14 08:27:36

Cool.. thank you very much :-)

Peter Nielsen 2010-07-17 18:10:38

Answer 4

+5 A:

I would steer clear of lxml, its too fussy for my taste. I'd try html5lib if I were you. It not only parses html, but deals robustly with the sort of errors you see in the tag soup known as invalid html.

It even has a BeautifulSoup emulation mode, producing a parse tree in the Beautiful Soup form to ease porting old code across:

import html5lib
from html5lib import treebuilders

f = open("mydocument.html")
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("beautifulsoup"))
minidom_document = parser.parse(f)

fmark 2010-07-14 08:34:09

Have you also tried lxml.html (instead of lxml.etree)? I've had good experiences with it, even with pretty bad tag soup.Also note that you can use the html5lib parser with lxml too.

Steven 2010-07-14 10:45:30

No, I haven't but I will now :)

fmark 2010-07-14 11:24:57

I think I'll stick with BeautifulSoup

Peter Nielsen 2010-07-17 18:11:15

ansaurus

tags:

views:

answers:

No more BeautifulSoup

related questions