views:

268

answers:

4

Some days ago I received a rather lengthy and somewhat elaborate MS Word document, which I was asked to convert to HTML for uploading to a 3rd party’s website. My first instinct was to save the Word document as HTML and use Dreamweaver’s "Clean Up Word HTML" Command. But not only did I have to leave it running all night for Dreamweaver to finish "cleaning", but the results were far from desirable in my opinion. There were still a lot of left over inline styles, etc. that Dreamweaver just plain missed.

I approached it differently this morning and just selected the entire document in Word, copied it, and then pasted it into Dreamweaver’s Design window. Not only was it much, much faster, but the output code was much, much cleaner! I didn’t have to run the "Clean Up Word HTML" Command afterwords either.

Now I don't ever convert a Word File straight to HTML for standards reasons. Instead I cut and paste content between Word and Dreamweaver. Happily I can do the following.

  1. If a Word heading is in the Heading 1 Style, it will become an H1 in Dreamweaver (following the Dreamweaver stylesheet). Similarly Heading 2 becomes H2, Heading 3 becomes H3 and so forth.

    If the Word author wasn't that organized, you can use a shortcut like Control+1 (or Command+1) on a Mac to convert any line to an H1. Can you guess the shortcut for H2? Yes it's Control+2 or Command+2 on a Mac.

  2. Paragraphs now cut and paste as paragraphs (with the P tag). If you don't want an HTML paragraph right then, then use Control+0 (or Command+0 on a Mac) to remove it in Dreameaver.

  3. A new one I discovered is that some embedded images in Word may be transferred to your Dreamweaver site as "clip" images when you copy and paste from Word. So, if you have a Word file with embedded images, you may be able to extract them fairly quickly via Dreamweaver.

I also found this free tool useful http://www.textfixer.com/html/convert-word-to-html.php it works same like design view of dreamweaver, useful for people who doesn't have Dreamweaver.

but what code we will get is depends on how much properly formatted MS word document is?

WORD 2007 has also style like html?

Headings, tables, ordered and unordered lists, bold, italic , hyperlinks etc?

How to use word 2007 semantically?

  • To get maximum possible semantic html on save as html option

  • To get maximum possible clean code to Copy in dreamweaver design view ?

  • To get maximum possible clean code to place browser based WYSIWYG HTML
    Editor which comes with every CMS

Does any knows any tips, tricks, tutorial , article or advice to format MS WORD documents semantically?

Or any other best way than mine?

A: 

There is no dependable way to clean up WORD docs and make them into nice HTML. If the document has any special characters, they are often encoded as Windows charset instead of UTF-8, so they just "break" when displayed online. The list goes on. You often end up with silliness like:

<strong>hello</strong><strong>th<strong>er</strong>e</strong><i></i>

The only depandable method is to paste it into Notepad and mark it up manually. You can write a few macros to do things like insert <p></p> at paragraph breaks, but that's about it.

If there is a huge volume of material that needs to go online from Word, you may be better off using a PDF.

Diodeus
Dreamweaver convert easily special characters to entities so this is not a problem.
metal-gear-solid
A: 

have you tried this? Word Cleaner

nimbupani
yes but it's only free for files under 20 kb "File size is greater than 20 Kb. Please consider subscribing." http://textism.com/wordcleaner/users/signup
metal-gear-solid
+3  A: 
  • HTML Tidy has options for this: word-2000, bare and clean.

  • FCKEditor and similar try to clean up code pasted from Word.

  • There's (rather old now) demoroniser.

However don't expect miracles. It's unlikely that Word document will have decent structure (it theoretically could, but no Word user bothers with this). These programs can't add semantic information if it's not there.

As for semantic editing in Word – use styles. It supports headers properly (sadly not much else). You can check that in outline view.

You don't need – and shouldn't use – spaces or line breaks for indentation or space adjustment. Word has ability to explicitly control paragraphs' padding.

porneL
+1  A: 

I've found that the OpenOffice.org html generator (Open .doc in OO and save as HTML) works better than MS's in Office.

It's still not perfect, but gives MUCH cleaner HTML that's much more sane to look at.

Sean Madden