views:

551

answers:

3

The majority of content on my company's website starts life as a Word document (Windows-1252 encoded) and is eventually copied-and-pasted into our UTF-8-encoded content management system. The conversion usually chokes on a few characters (special break characters, smart quotes, scientific notations) which have to be cleaned up manually, but of course a few always slip through.

What do you think the best way would be to detect these?

+1  A: 

Can you save the text as .rtf and then parse it using some other program?

Can you use Word's VBA to save the text as something sane?

Dana Robinson
Unfortunately, training the end-users has not been very successful. Copy-pasting from word to notepad and then the CMS resolves all the issues, but users are reticent to follow this cumbersome step. I'm trying to find a solution that can resolve this server-side.
Chris Pebble
Can you install Word on the CMS server? If so, you might be able to use COM interop to convert the text into something the CMS will handle.
Dana Robinson
+1  A: 

As already mentioned it would be best to export the Word contents to a parsable format (either RTF or XML would do).

There might be a specific reason for using copy-and-paste to add the material to your CMS but with copying and pasting you probably will always end up with some kind of visual check and fix round unless you create a tool that monitors the clipboard.

When copying and pasting from (a recent version) of Word the clipboard has several different formats that can be used, one of the formats is XML based. It would be possible to create something that will cleanup the Word XML on the clipboard and "set" the text version (that you probably paste to the CMS) to the cleaned up format.

You could use the Word.interop that comes with office and standard C# clipboard functions to create this. The tool could work on top (in the background) of Word while adding content to the CMS.

barry
Interesting, I'm taking a look at implementing something like this and will let you know how it turns out!
Chris Pebble
+1  A: 

How exactly are you doing the conversion?

The whole copying-from-Word problem is something I've come across more often, but it should really be easy to solve.

Those chararacters you mention are all in the 0x80 - 0x9F range in which the Windows-1252 code page differs from the ISO-8859-1 code page. That range is undefined in ISO-8859-1.

You must be doing the conversion from ISO-8859-1 (or perhaps ISO-8859-15) instead of Windows-1252, causing it to choke on characters in that range.

You should either adjust the source encoding of your conversion or, if that's somehow not possible (I'm not familiar with C#, but I doubt it), use the code page chart to fix the 32 problem characters separate from the main conversion.

mercator