tags:

views:

727

answers:

4

I'm wondering how you clean the special characters that MS Word as, such as m- and n-dashes and curly quotes?

I often find myself copying content from clients from Word and pasting into a static HTML page, but the content ends up with weird characters because the special characters are not converted to their correct ACSII codes and therefore show up as garbled text. (For these basic websites, I'm using Dreamweaver.)

I have seen a lot of similar problems when clients copy content from Word into text only fields (mostly textareas). When I put this into a PDF (through PHP) or it shows up on the page it too has garbled text.

How do you deal with this? Is there a cleaning service or program you use?

+1  A: 

Pay attention to specify an encoding everywhere and use UTF-8, then those "special" characters should survive just fine. But once they've gone through an encoding that can't represent them, the information which character it was originally is lost, so it can't be repaired (except for some specific though probably very common cases like switching between Cp1252 and ISO-8859-1).

Michael Borgwardt
For the HTML pages especially, everything is UTF8, so that's not the problem.
Darryl Hein
If the characters get garbled, NOT everything is UTF-8. Common culprits are a missing accept-charset attribute of forms, and certain web browsers that don't interpret it correctly.
Michael Borgwardt
Well, if any browser doesn't interpret it right, then I'd say it doesn't work. Here are my doctype etc: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> Is there something wrong there?
Darryl Hein
XHTML itself is problematic, see http://www.dev-archive.net/articles/xhtml.html though I have not heard about problems speficially with encodings. As I wrote: do the forms have an accept-charset defined, and do you use a recent browser? What language/environment is used to process the form data? Does it use UTF-8 correctly?
Michael Borgwardt
Ah, I see you're using PHP. Well, there's your problem, most likely. Read here: http://www.phpwact.org/php/i18n/charsets And especially note the section about contradicting encodings in the HTML headers and the page itself.
Michael Borgwardt
The charset is getting confused somewhere, but you'll need to track it through every stage to find the problem. Note, though, that a common problem occurs if you're copying into text fields and text areas, as browsers don't normally send the charset with POST submissions, and the HTTP default is ISO-8859-1 not UTF-8. You may need to tell the web server to expect UTF-8 in the submitted data.
Alohci
Yes, but the problem also happens on static HTML pages that are not using PHP. The page on the computer and on the server have the garbled text.
Darryl Hein
In that case, you only need to ensure that you actually use UTF-8 to save the page, that the UTF-8 is declared as encoding either in the HTML header, the XML declaration or the META tag, and in the latter cases that the server does not send a contradicting HTML header.
Michael Borgwardt
+1  A: 

You might try the Demoroniser.

Adrien
Dang, that is nice. If no one comes up with anything better, that might just work.
Darryl Hein
A: 

If it's a Word file that's just text (i.e.: no graphics, tables, etc.), you might try Saving As HTML from within Word, copy/pasting the resulting HTML into your document in Dreamweaver, and then use Dreamweaver's "Clean Up Word HTML" function (under the Command menu).

As an alternative, you can try fix my HTML, though I've not personally tried it with Word text, so results may vary.

Scottie
I'm trying to find something that doesn't take 5 steps to get into Dreamweaver and it'd also be nice to have something that I can give to clients to clean their Word content as well.
Darryl Hein
+1  A: 

With regards to clients posting copy/pasted text from Word in textareas:

The most reliable way to ensure that the client sends you text in any particular encoding (thus hopefully doing any conversion from CP-1252 [or whatever Word uses] for you), is to add the accept-charset="..." attribute to all your <form>s. E.g.:

<form ... accept-charset="UTF-8">
   ...
</form>

Most browsers will obey that and make sure any "Word-specific" characters are converted to the appropriate character set before it gets to your website.

Once invalid text gets to your website, there's very little you can do to fix it reliably, so it's best to simply check all input for being valid in whatever character set you use, and discard any requests that have invalid text. This is necessary even with accept-charset, because undoubtedly there are some clients out there that will ignore it.

chazomaticus