views:

463

answers:

6

In my current company, we have this decade old...let's call it a "Hello World" application.

While wanting to create a newer version of it, we also want to preserve older entries. These older entries contain hideous Word-generated HTML which was never filtered before.

If and when we move to a newer system, I'd prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.
However, just cleaning that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.

Now, that just might cause our users to revolt and then all hell will break loose - not a very good idea.

So the question is: Can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)

Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.

+1  A: 

Do have a budget for it. This might Work . Try before you buy.

scope_creep
@scopr-creep: Thanks, but I'm looking for a solution I can run locally, for batches of thousands of files.
GeReV
+1  A: 

Take a look at FCKEditor , its a javascript-based editor, so looking at the source might give you lots of hints as to what to look for when removing word HTML.

In particular, take a look at the file, /editor/dialog/fck_paste.html. There's a function, "CleanWord" does it all. I've modified it for use in my own applications (slight modifications, ie. different replacements, etc...), however it does a great job of getting rid of ugly Word HTML.

It does it using regular expressions to find and replace, which means you can easily extra the regex and import it into another programming language of your choice to run the batch job.

Anton
From my experience with the newer CKEditor, the paste from word function just opens a standard text box which omits all formatting. Is FCKEditor different in that regard?
GeReV
FCKEditor is an older version. They changed the name to CKEditor, because the "FCK" made it look like the F-word (the creator is Brazilian so didn't realize this)
Anton
+2  A: 

tidy works fine for cleaning up and regularizing html syntax.

It's very configurable, so for a batch cleanup, it's likely the command line tool will do what you need. You don't have to program tidylib yourself.

If you need to do more involved cleanup of the content - not just the syntax - some xslt processors ( xsltproc, for one ) have an '--html' option: input files are parsed by the html parser instead of an xml parser. You can then use xslt to transform or rearrange the content, then output with the html serializer.

Steven D. Majewski
+2  A: 

This SO question poses a similar problem, although there, programmatic cleanup is not required.

One of the answers mentions that Office 2007 has a Publish->Blog menu item that reportedly produces good results and is fast. You could create a macro from Word to invoke this command, and then programmatically invoke the macro. You can use COM or VBScript to start word and run the macro, or run winword.exe with the /m switch. Command line switches to winword.exe are given here.

mdma
+3  A: 

There are a couple of options available, but you can certainly use Jeff Atwood's as a good starting point to code your own. If so, you'll likely get fine-tuned control over the result - note though that the results will be never been 100% accurate as all that extra ms-code is actually there to ensure as much fidelity with the original document as possible (at least in IE for round-tripping purposes). But most code out there does preserve most formatting.

Here are some code libraries that could be helpful:

If you're just wanting batch-processing (and don't care about owning a code base), the Office 2000 HTML Filter 2.0 is probably your best best - read more about it on TechRepublic.

Otaku
+1  A: 

PSPad includes tidy, which has a "Clean Microsoft Word 2000" option which I've used for word documents before and it's customizable.

McAden