views:

312

answers:

3

I have quite big document in html format that generated from Microsoft Word. It is soooo messy and full of bloated things (like unknow tag, unknow namespace etc and other bloated things)

is there any way to convert it into plain html sytax ?

thanks !

+1  A: 

This isn't really a programming question, but (at least recent versions of) Word can save to "Web Page, Filtered", which removes Office-specific tags and properties and only leaves the tags necessary for the document to be rendered in a web browser. So, if you have Word, you could try using it to open the HTML document and save it in that format.

CyberShadow
+2  A: 

Try HTML Tidy. I hear it works quite well on HTML generated by MS Word (definitely at least up to Word 2000, but probably on more recent versions too).

David Zaslavsky
+2  A: 

You're probably looking for HTML Tidy, which has adapters in pretty much every language out there. It has options to clean up Microsoft Word HTML output (and many other features).

cletus