I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?
A:
There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.
Nicholas Knight
2010-06-19 00:49:09
+3
A:
I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.
from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettyify()
I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.
Hope that helps,
Will
JudoWill
2010-06-19 01:31:57
Take caution with performance, BeautifulSoup is very expansive.
Tarantula
2010-06-19 01:37:28
@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there.
JudoWill
2010-06-19 01:44:44
That's true JudoWill.
Tarantula
2010-06-19 02:01:19