ansaurus

Question

Answer 1

+4 A:

You need to add on 0x10000 when decoding the surrogate pair; to quote rfc 2781, the step you're missing is number 5:

    1) If W1 < 0xD800 or W1 > 0xDFFF, the character value U is the value
       of W1. Terminate.

    2) Determine if W1 is between 0xD800 and 0xDBFF. If not, the sequence
       is in error and no valid character can be obtained using W1.
       Terminate.

    3) If there is no W2 (that is, the sequence ends with W1), or if W2
       is not between 0xDC00 and 0xDFFF, the sequence is in error.
       Terminate.

    4) Construct a 20-bit unsigned integer U', taking the 10 low-order
       bits of W1 as its 10 high-order bits and the 10 low-order bits of
       W2 as its 10 low-order bits.

    5) Add 0x10000 to U' to obtain the character value U. Terminate.

ie. one fix would be to add an extra line after your first read:

cur = (old.data[i] & 0x3ff) << 10;
cur += 0x10000;

JosephH 2010-09-24 13:18:03

Wow, thanks! A simple missing step added in, and my UTF-16 decoder works!

Delan Azabani 2010-09-24 13:20:28

No problem, glad to hear it works now. Thanks for fixing my typo :)

JosephH 2010-09-24 13:28:33

Answer 2

A:

You seem to be missing an offset of 0x10000.

According to this WIKI page, UTF-16 surrogate pairs are constructed like this:

UTF-16 represents non-BMP characters (U+10000 through U+10FFFF) using two code units, known as a surrogate pair. First 1000016 is subtracted from the code point to give a 20-bit value. This is then split into two 10-bit values each of which is represented as a surrogate with the most significant half placed in the first surrogate.

Bart van Ingen Schenau 2010-09-24 13:43:51

ansaurus

tags:

views:

answers:

UTF-16 decoder not working as expected

related questions