ansaurus

Question

Answer 1

A:

There are python bindings for the HTML Tidy Library Project, but automatically cleaning up broken HTML is a tough nut to crack. It's not so different from trying to automatically fix source code -- there are just too many possibilities. You'll still need to review the output and almost certainly make further fixes by hand.

Nicholas Knight 2010-06-19 00:49:09

Answer 2

+3 A:

I would suggest Beautifulsoup. It has a wonderful parser that can deal with malformed tags quite gracefully. Once you've read in the entire tree you can just output the result.

from BeautifulSoup import BeautifulSoup
tree = BeautifulSoup(bad_html)
good_html = tree.prettyify()

I've used this many times and it works wonders. If you're simply pulling out the data from bad-html then BeautifulSoup really shines when it comes to pulling out data.

Hope that helps,

Will

JudoWill 2010-06-19 01:31:57

Take caution with performance, BeautifulSoup is very expansive.

Tarantula 2010-06-19 01:37:28

@Tarantula. I agree, BeautifulSoup is pretty slow, but its the only thing I've ever seen that can parse some of those crazy malformed HTML based tables out there.

JudoWill 2010-06-19 01:44:44

That's true JudoWill.

Tarantula 2010-06-19 02:01:19

ansaurus

tags:

views:

answers:

Clean Up HTML in Python

related questions