views:

1555

answers:

12

This might have been asked in another way. I am not doing it on the fly however. Once in a while we get pieces of content in word files that have em dashes, bold, italic text and block quotes. Is there a good tool to convert this into a clean html code.

Otherwise what other approaches people take.

A: 

Word is very "dirty" with its own coding. It can have nested bold tags, empty bold tags and all kinds of nastiness depending on whether theuser used the built-in styles (heading 1, heading 2 etc.) vs changing font sizes. Anything that takes the Word doc and tries to "convert" it to HTML will inherit the same markup problems as well.

The best thing to do is record a macro in Word to perform multiple search-and-replace actions on obvious things, such as M-dashes, tabs, ellipsis etc.

Then replace paragraph breaks ^p^p with a placeholder (like ~), then replace all single breaks (^p) with a space, then replace ~ with </p>^p</p> to generate HTML paragraphs.

Then copy the entire document, paste it into Notepad to remove any non-ascii markup, then copy and paste that into your HTML editor, and manually mark-up the 10% that's left over, like bold italics, mismatched paragraph tags etc.

Nothing will ever be as good as hand-coding, so with this technique most of the grunt work is done, and you have clean text to start from.

Diodeus
Also you should be able to automate the "paste into Notepad" part by Calling GetText on the Clipboard object with the appropiate type.
Binary Worrier
A: 

A long time ago I was tasked with taking a reasonably well structured multi-megabyte word document and converting it into a series of HTML pages (about 20,000 of them!) This was accomplished by saving the word doc as RTF (Word's Save As HTML output was much too "dirty") and converting the RTF to HTML via a Perl script. The conversion was a two pass process... First clean up common formatting errors, then convert the cleaned RTF to HTML.

Since the document editors continued to maintain the Word document, it payed to codify common formatting errors in the first pass because the errors often reoccurred even after being fixed.

Incidentally, this process showed a very skeptical management how in just 40 hours (or so) a good coder could produce ~20,000 web pages and keep them up to date indefinitely, while the original authors (who's time was even more valuable) would have spend multiple hundreds of hours doing the conversion and would have been forced to maintain the resulting HTML by hand thereafter.

Chris Nava
A: 

Convert to RTF and use an XSLT to convert the rich text to HTML. I would recommend trying to get everything as RTF instead of .docx or whatever Word format.

Ty
A: 

You may want to give this tool a try: OpenXML Document Viewer.

It offers a command line tool for converting OpenXML (DOCX) documents into HTML.

0xA3
A: 

If you can install Word 2003 or 2007, then you can use the new OOXML format to generate XML files. The format is pretty weir...complex but at least you can parse it with standard tools. That should allow you to extract the information you need from the file.

The file OfficeXMLMarkupExplained_en.docx contains an introduction and many details how OOXML works.

Aaron Digulla
A: 

See also this SO Question: Is there an html/css normalizer that works?

Ken Gentle
+1  A: 

The easiest and faster way for me is to copy all the text from Word and paste it into the wysiwyg editor of Dreamweaver (any version from MX to CS3) using the paste special command and choosing to keep just the structure of the document. It works great if your word document is not too complex, and if it is really complex you need just an extra editing in the code view. The resulting html is really clean.

The only problem with this method is that you need Dreamweaver, that is not free. Anyway, you can test the method with the trial version of DW.

alexmeia
+1  A: 

I am surprised no-one has mentioned it, but HTML Tidy normally does a good job of this. I haven't used it recently, but I understand it's suitable for cleaning up HTML content exposed from Word in particular.

Andrew Ferrier
Tried it on current Word version, didn't get a good result at all - may handle older version HTML output better.
David Burrows
+1  A: 

I wrote a tool years ago called CleanXHTML 1.2 for Microsoft Office Word 2003 (.NET 2.0). This is designed to work inside of Word and allows you to export XHTML based on what is highlighted (or selected) in the document. I've been sitting on a Word 2007 version for years.

rasx
I will try this in Word 2007
metal-gear-solid
A: 

Also try http://www.manglebracket.com/, it's a web app where you upload a Word DOC and it converts it to HTML with various (too many really) options. Perfect for ad-hoc conversion, when your copywriter sends you a press release in Word and you want to put it on the site, for example.

darkporter
A: 

You can also try this Doc to HTML Converter, it can generate very clean HTML from MS Word in batch through use of GUI interface or in command line mode.

ZWolf
A: 

I wrote a command-line utility to do this: for details, see this Doc to HTML converter.

ChrisW