views:

68

answers:

2

I thought BeautifulSoup could do that, but it does not seem to do the trick.

What method have you already used, and is long term reliable ?

+4  A: 

You could use the lxml library, specifically lxml.html which gives you an ETree object which you can then serialize as XML with (amongst others) the .tostring() method.

If this fails on your HTML (it is too broken) you can use ElementSoup (an extension on BeautifulSoup) to build a lxml.html tree.

ikanobori
+2  A: 

You can try http://utidylib.berlios.de/ , a python wrapper for tidy library. Tidy works well in most cases.

For something more robust (or at least more browser-like), I guess you could try webkit or gecko. I'm not sure the wrappers responsible for cleaning HTML are available, but you can have a look.

Scharron