views:

377

answers:

2

If you look at this table here, it has a list of escape sequences for Unicode characters that don't actually work for me.

For example for "%96", which should be a –, I get an error when trying decode:

decodeURIComponent("%96");
URIError: URI malformed

If I attempt to encode "–" I actually get:

encodeURIComponent("–");
"%E2%80%93"

I searched through the internet and I saw this page, which mentions using escape and unescape with decodeURIComponent and encodeURIComponent respectively. This doesn't seem to help because %96 doesn't show up as "–" no matter what I try and this of course wouldn't work:

decodeURIComponent(escape("%96));
"%96"

Not very helpful.

How can I get "%96" to be a "–" with JavaScript (without hardcoding a map for every single possible unicode character I may run into)?

+1  A: 

See this question, specifically this answer:

there is a special “%uNNNN” format for encoding Unicode UTF-16 code points, instead of encoding UTF-8 bytes

I suspect "–" is one of those characters since 0x96 in the Ascii table is û

Josh
escape("–") creates "%u2013". This still doesn't explain how I can handle %96 when I encounter %96. I'm not encoding or escaping, I'm trying to decode! :(
apphacker
+1  A: 

Posting as a community wiki entry as it's from "Building Scalable Websites" by Carl Henderson. The book says it's OK to reproduce significant portions of the examples though. You may be able to create a special case for "-" with it.

function escape_utf8(data) {
        if (data == '' || data == null){
               return '';
        }
       data = data.toString();
       var buffer = '';
       for(var i=0; i<data.length; i++){
               var c = data.charCodeAt(i);
               var bs = new Array();
              if (c > 0x10000){
                       // 4 bytes
                       bs[0] = 0xF0 | ((c & 0x1C0000) >>> 18);
                       bs[1] = 0x80 | ((c & 0x3F000) >>> 12);
                       bs[2] = 0x80 | ((c & 0xFC0) >>> 6);
                   bs[3] = 0x80 | (c & 0x3F);
               }else if (c > 0x800){
                        // 3 bytes
                        bs[0] = 0xE0 | ((c & 0xF000) >>> 12);
                        bs[1] = 0x80 | ((c & 0xFC0) >>> 6);
                       bs[2] = 0x80 | (c & 0x3F);
             }else if (c > 0x80){
                      // 2 bytes
                       bs[0] = 0xC0 | ((c & 0x7C0) >>> 6);
                      bs[1] = 0x80 | (c & 0x3F);
               }else{
                       // 1 byte
                    bs[0] = c;
              }
             for(var j=0; j<bs.length; j++){
                      var b = bs[j];
                       var hex = nibble_to_hex((b & 0xF0) >>> 4) 
                      + nibble_to_hex(b &0x0F);buffer += '%'+hex;
              }
    }
    return buffer;
}
function nibble_to_hex(nibble){
        var chars = '0123456789ABCDEF';
        return chars.charAt(nibble);
}
David Morrissey
Not sure how this helps since I am trying to convert the %nn format to unicode, not the other way around.
apphacker
I suppose you could replace "-" with "%E2%80%93" before decoding it in javascript with decodeURIComponent, but it may have side effects so I don't know.
David Morrissey
Unfortunately I don't have control of the espace process, it's a string in the wild. I just need to be able to handle the "%nn' format in all cases.
apphacker
If the codes are single "%nn" codes are you sure the source data is encoded in Unicode and not ASCII/Latin-1 etc? The decode/encodeURIComponent probably use the current page's encoding, which I'm assuming is UTF-8 in your case. If so, it might be a hack but you could have another IFrame with a different encoding communicating with the parent or something similar as a last resort.Or, you could use a table from the other encoding to Unicode and decode by splitting up the '%' chars or something similar and process yourself. Just some suggestions :-)
David Morrissey