views:

907

answers:

2

I'm reading an HTML document that contains UTF-8 chars but when I access the innerHTML of the document, all the "bad" chars show up as 0xfffd. I've tried it in all the major browsers and it behaves the same way. When I alert() the innerHTML it shows those chars as a "diamond with a ? mark".

Surprisingly the following works perfectly, correctly displaying the UTF-8 char in the alert box, so its not alert() is malfunctioning.

alert("Doppelg\u00e4nger!");

Why can't I access the UTF-8 chars using innerHTML? Or is there another way to access them in JavaScript.

A: 

Is the page sent with a UTF-8 charset? .innerHTML has never given me any trouble with UTF-8.

Greg
And just how do you debug it? I'm reading the innerHTML from within a frame, if that causes any trouble.
Jenko
You can either look at the headers or the page properties - what browser are you using?
Greg
+2  A: 

First, check if the document header contains.

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

You can also read out the meta-tags with javascript:

var metaTags = document.getElementsByTagName("META");

If it does, this is the explanation of the behavior. You can try changing utf-8 to ISO-8859-1:

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

Better is to htmlEncode all extended characters in your HTML. Like this:

function encodeHTML(str){
 var aStr = str.split(''),
     i = aStr.length,
     aRet = [];

   while (--i) {
    var iC = aStr[i].charCodeAt();
    if (iC < 65 || iC > 127 || (iC>90 && iC<97)) {
      aRet.push('&#'+iC+';');
    } else {
      aRet.push(aStr[i]);
    }
  }
 return aRet.reverse().join('');
}

Mind you, this function will encode everything that is not [a-zA-Z]. This function will encode Doppelgänger in Doppelg&#228;nger for example.

KooiInc
Pretty cool. Anyways I found that the problem was with the HTML page itself.
Jenko