views:

98

answers:

2

I'm aggregating content from a few external sources and am finding that some of it contains errors in its HTML/DOM. A good example would be HTML missing closing tags or malformed tag attributes. Is there a way to clean up the errors in Python natively or any third party modules I could install?

A: 

There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

Nicholas Knight
+3  A: 

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettyify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

Hope that helps,

Will

JudoWill
Take caution with performance, BeautifulSoup is very expansive.
Tarantula
@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there.
JudoWill
That's true JudoWill.
Tarantula