ansaurus

Question

Using JavaScript to fix badly encoded Unicode characters?

Answer 1

A:

jcubic 2010-07-08 22:54:30

Answer 2

A:

Depending on the number of contributors using the CMS, honestly I think your safest and simplest bet may be to try to enumerate all the illegal characters and supply your own replacements. In my experience, the list is usually pretty small -- four smart quotes, m-dash, ellipsis, non-breaking space are usually the only culprits I see. Every company may be a little different (some companies will frequently use TM, Copyright and Registered, but you'll tend to see these characters a lot and you only have to add them to your list once). Accents and diacriticals tend not to be a problem nowadays.

I suspect the problem is made slightly harder by the character encodings for these symbols seems to be bound to the font that the user opts to use -- which is the only way I can explain two users sitting side-by-side on identically configured machines producing different extended characters. So do a search through your site text for any extended characters, and add them by hand to a JavaScript file you've saved in UTF-8.

Sample code might look like:

strProblemText = "“I’d say, ‘Get’em all…” – Pokemon Master©";
arrExtendedChars = "“”‘’…–©".split('')
arrReplacements = ['"','"',"'","'",'...','-','&copy;'];
for (var i = 0; i < arrExtendedChars.length; i++) {
    strProblemText = strProblemText.replace(new RegExp(arrExtendedChars[i],"ig"),arrReplacements[i])
}
alert(strProblemText);

The syntax of the second line is a bit of a headache to look at, but it basically splits the string into an array of single characters and it allows you to keep all your problem characters together on one line. I just find it easier to maintain. Others may slightly disagree. Still others still may think I'm insane.

As mentioned by @Pointy, it is best to do this when the text is going into the database, or at least prior to it being sent to the user's page, but doing it after the text has been sent and loaded is still a viable option.

Andrew 2010-07-08 23:50:06

ansaurus

tags:

views:

answers:

Using JavaScript to fix badly encoded Unicode characters?

related questions