views:

59

answers:

2

I'm working on a web site with a content management system that does a bad job of displaying any text with non-ASCII characters. For example, right single-quotes show up as on the following web page (this is just an example, not from the CMS-driven web site):

http://www.gregboettcher.com/cmsunicode.html

I can't control the inner workings of the CMS, but still I'd like to try to fix this glitch somehow.

I tried messing around with the charset definition of the page, but changing it from UTF-8 to ANSI or UCS-2 just made it worse.

Here is my main question: Could JavaScript be used to somehow find badly encoded Unicode characters and make them display properly?

I'm grasping at straws here. Many thanks to anyone who can help.


Edited June 12:

To everybody who replied, thanks for giving some helpful responses to a pretty vague question.

I've looked into this some more, and it looks like the CMS is writing UTF-8 to the database, but then reading it from the database with the expectation of something other than UTF-8 (even though it then produces web pages that say "charset=UTF-8").

I agree it would probably be best to try to fix this by preventing non-ASCII characters from being written to the database, but with the CMS I'm using, that's not very practical.

I told my supervisor we could still use JavaScript to fix the problem on the client side, but when I explained what it would involve, he told me not to bother. He seems content to understand what's causing the problem, and forward the bug on to the makers of the CMS.

So thanks -- I learned something about text encoding and JavaScript from this.

A: 
jcubic
A: 

Depending on the number of contributors using the CMS, honestly I think your safest and simplest bet may be to try to enumerate all the illegal characters and supply your own replacements. In my experience, the list is usually pretty small -- four smart quotes, m-dash, ellipsis, non-breaking space are usually the only culprits I see. Every company may be a little different (some companies will frequently use TM, Copyright and Registered, but you'll tend to see these characters a lot and you only have to add them to your list once). Accents and diacriticals tend not to be a problem nowadays.

I suspect the problem is made slightly harder by the character encodings for these symbols seems to be bound to the font that the user opts to use -- which is the only way I can explain two users sitting side-by-side on identically configured machines producing different extended characters. So do a search through your site text for any extended characters, and add them by hand to a JavaScript file you've saved in UTF-8.

Sample code might look like:

strProblemText = "“I’d say, ‘Get’em all…” – Pokemon Master©";
arrExtendedChars = "“”‘’…–©".split('')
arrReplacements = ['"','"',"'","'",'...','-','©'];
for (var i = 0; i < arrExtendedChars.length; i++) {
    strProblemText = strProblemText.replace(new RegExp(arrExtendedChars[i],"ig"),arrReplacements[i])
}
alert(strProblemText);

The syntax of the second line is a bit of a headache to look at, but it basically splits the string into an array of single characters and it allows you to keep all your problem characters together on one line. I just find it easier to maintain. Others may slightly disagree. Still others still may think I'm insane.

As mentioned by @Pointy, it is best to do this when the text is going into the database, or at least prior to it being sent to the user's page, but doing it after the text has been sent and loaded is still a viable option.

Andrew