ansaurus

Question

Convert ISO/Windows charsets to UTF-8 in Javascript

Answer 1

+2 A:

Once XMLHttpRequest has tried to decode a non-UTF-8 string using UTF-8, you've already lost. The byte sequences in the page that weren't valid UTF-8 sequences will have been mangled (typically converted to �, the U+FFFD replacement character). No amount of re-encoding/decoding will get them back.

Pages that specify a Content-Type: text/html;charset=something HTTP header should be OK. Pages that don't have a real HTTP header but do have a <meta> version of it won't be, because XMLHttpRequest doesn't know about parsing HTML so it won't see the meta. If you know in advance the charset you want, you can tell XMLHttpRequest and it'll use it:

xhr.open(...);
xhr.overrideMimeType('text/html;charset=gb2312');
xhr.send();

(This is a currently non-standardised Mozilla extension.)

If you don't know the charset in advance, you can request the page once, hack about with the header for a <meta> charset, parse that out and request again with the new charset.

In theory you could get a binary response in a single request:

xhr.overrideMimeType('text/html;charset=iso-8859-1');

and then convert that from bytes-as-chars to UTF-8. However, iso-8859-1 wouldn't work for this because the browser interprets that charset as really being Windows code page 1252.

You could maybe use another codepage that maps every byte to a character, and do a load of tedious character replacements to map every character in that codepage to the character it would have been in real-ISO-8859-1, then do the conversion. Most encodings don't map every byte, but Arabic (cp1256) might be a candidate for this?

bobince 2010-04-20 11:34:44

thanks - working...I guess you are write i will have to fetch non-utf-8 pages twice...

Amir 2010-04-20 11:49:41

Added some ideas on avoiding the second fetch... not very nice ideas, but potentially workable.

bobince 2010-04-20 11:51:03

ansaurus

tags:

views:

answers:

Convert ISO/Windows charsets to UTF-8 in Javascript

related questions