views:

5889

answers:

15

Please provide the single best option you are aware of.

+7  A: 

HTML Tidy does a very good job on Word 2000 HTML, but I'm not sure how well it works on newer Word output.

owenmarshall
+2  A: 

Dreamweaver has a "clean up word HTML" option. Granted, I know it isn't perfect, but it also the only thing I have worked with.

Scott S.
I actually decided to go ahead and use this option, since Dreamweaver also allows you to easily edit things as needed. However, I did say free, so I can't really accept this one, sorry.
jeremcc
+1  A: 

In my opinion? Don't use it.

But in the real world, I've found that FCKEditor does a decent job of cleaning up Words fantastically hideous HTML.

foxxtrot
A: 

How about Bersoft Word HTML CleanUp?

Geoffrey Chetwood
A: 

The single best option I am aware of? Don't use Word HTML.

Chris Upchurch
I'm getting them from the users and don't want to have to retype it all.
jeremcc
Any way you can get them to give it to you in some other format? Heck, even if you can get them to give you the actual Word file you could probably find some sort of converter that produces better HTML than word does.
Chris Upchurch
+3  A: 

Rather than cleaning up Word's HTML you could generate HTML directly from the Word document using Abiword. (wv is now deprecated in favour of Abiword; it's basically been absorbed.)

An example:

    AbiWord --to=html file.doc --exp-props="html4: yes"

See more in the documentation.

Porges
+10  A: 

Word 2007 has a "publish > blog" menu item on the Office menu (top left corner).

Using this feature seems to do an incredibly good job of cleaning the HTML, far better than any of the other HTML exporters built into Word (like "save as HTML Filtered").

I have actually set up a bogus free blog somewhere just to use this HTML-cleaning capability. Most long articles on Joel on Software originated in Word 2007 and was published to a fake blog just to clean up the HTML.

Edit: as pointed in comments, be sure that you enter a title for the fake post. If you don't, Word will show a generic error "Can't publish your post"

Joel Spolsky
This worked pretty well, but in my case it formatted the HTML slightly different from the Word doc, and since it was coming from other people, I didn't want to actually alter the formatting.
jeremcc
this worked great for me. i created a livespaces account. be careful not to accidentally select the link at the top of the blog that links you back to livespaces. also when creating the blog entry you need to enter a title otherwise it will give you a useless error telling you it it cant be posted
Simon_Weaver
in Visual Studio look for tag like <div id="msgcns!128FB3A5A1708E5A!123" class="bvMsg"> and click between the '<' and the 'd' in 'div'. at the bottom where you have a tree of elements click on the last div and under the popup menu click 'select tag contents'. you can then cute and paste this HTML
Simon_Weaver
A: 

I have only used vim and it worked just fine.

/Allan

Allan Wind
+1  A: 

Not exclusive to Word documents, but it is free (and open source). You might try Tidy: http://www.w3.org/People/Raggett/tidy/

+3  A: 

C# solution from Jeff Atwood.

Pavel Chuchuva
+2  A: 

Try Wordoff: http://wordoff.org/ online

Derek
This one works great. Strips out the spurious attributes too.
Sherri
+1  A: 

There is a tool I wrote awhile back, it's a web application for converting Word DOC files to HTML. You just upload the .doc file and you get this interactive view of the conversion with a bunch of different options to tweak it. It's up here if you want to give it a try:

http://www.manglebracket.com/

darkporter
Looks cool. I'll try it sometime.
jeremcc
MangleBracket was offline for awhile. It's back up now.
darkporter
+1  A: 

I found this tool: http://opilsoft.com/doctohtml.html, Opilion Software DocToHtml, it does a very good job. It produces the smallest HTML among all utilities that I tried.

GW83
A: 

Some online solutions:

Word HTML Cleaner:

A free tool that removes the excessive tags and clutter from Microsoft Word generated HTML documents, leaving basic formatting intact. File sizes are greatly reduced, and the returned HTML is easier to read, revise and employ.

Word2cleanhtml:

This site cleans up the HTML for any pasted document. It applies a number of filters to fix various things that Microsoft Office puts in its HTML, and gives you a nicely formatted result that you can paste directly into a web page or content editing system.

Textism: Word HTML Cleaner:

A tool to strip Microsoft’s proprietary tags and other superfluous noise from Word-generated HTML documents, leaving all the basic goodness intact. The service is free of charge for documents up to 20Kb in size. For larger files, subscriptions are available, proceeds from which go to keep your host’s dogs in biscuits and squeaky toys.

Mehper C. Palavuzlar
A: 

There is a solution for this on Coding Horror: Cleaning Word's Nasty HTML.

Mehper C. Palavuzlar