views:

29

answers:

3

This one's driving me nuts . . .

I have a bunch of MS Word files that a client wants displayed on his web site. I've converted them to HTML using "Save as Web Page" -- and yes I know that this produces lousy HTML but other methods I've tried lose the links to the imbedded images.

For the most part, I can use PHP to clean up the display but one item has me completely baffled: All single and double quotes are coming through as various letters with diacritics (accents) and I can't figure out how to detect them and convert them to the correct HTML entities. For example: Õ (O tilde)should be single-quote, Ò (O grave) should be open double-quote, Ó (O acute) should be close double-quote. I've tried htmlentities, iconv and a bunch of other methods with no luck. I welcome suggestions.

Thanks

Mark

A: 

I suggest open those lousy html files into an editor like: Notepad++ and just do a search and replace in all open documents.

Ruel
Expanding on this, Notepad++ comes by default with the TextFX plugin, which has a 'HTML Tidy -> clean Microsoft Word 2000 document' function (admittedly I never had to use the thing).
djn
A: 

What's the encoding of the Word Document? You can either try to match the original encoding through PHP or change the encoding of the Word Document to something like UTF-8 and make sure your page is displayed as UTF-8 as well.

methodin
+1  A: 

Word is a mess! For individual files I run through something like this: http://word2cleanhtml.com/

If this is going to be an ongoing thing, there are entire file libraries dedicated to de-word-ifying Word documents for the web. Try HTML Tidy or HTML Purifier

If you're going to be dealing with a WYSIWYG type tool and this is ongoing, CKEditor will automatically drop Word HTML garbage. The thing that differentiates CK from TinyMCE and others is that even if the user forgets to do "Copy From Word" it still will not allow the bad stuff through.

Since using CK and Tidy, I've not had a single problem on my company's site despite being used by hundreds of users with varying levels of web knowledge. Prior to the changes, it was a near-daily issue.

bpeterson76