tags:

views:

61

answers:

3

Can UTF-8 encode 5 or 6 byte sequences, allowing all Unicode characters to be encoded? I'm getting conflicting standards. I need to be able to support every Unicode character, not just those in the U+0000..U+10FFFF range.

(All quotes are from RFC 3629)

Section 3:

In UTF-8, characters from the U+0000..U+10FFFF range (the UTF-16 accessible range) are encoded using sequences of 1 to 4 octets. The only octet of a "sequence" of one has the higher-order bit set to 0, the remaining 7 bits being used to encode the character number. In a sequence of n octets, n>1, the initial octet has the n higher-order bits set to 1, followed by a bit set to 0. The remaining bit(s) of that octet contain bits from the number of the character to be encoded. The following octet(s) all have the higher-order bit set to 1 and the following bit set to 0, leaving 6 bits in each to contain bits from the character to be encoded.

So not all possible characters can be encoded with UTF-8? Does this mean I cannot encode characters from different planes than the BMP?

Section 2:

The octet values C0, C1, F5 to FF never appear.

This means we cannot encode UTF-8 values with 5 or 6 octets (or even some with 4 that aren't within the above range)?

Section 12:

Restricted the range of characters to 0000-10FFFF (the UTF-16 accessible range).

Looking at the previous RFC confirms this...they reduced the range of characters.

Section 10:

Another security issue occurs when encoding to UTF-8: the ISO/IEC 10646 description of UTF-8 allows encoding character numbers up to U+7FFFFFFF, yielding sequences of up to 6 bytes. There is therefore a risk of buffer overflow if the range of character numbers is not explicitly limited to U+10FFFF or if buffer sizing doesn't take into account the possibility of 5- and 6-byte sequences.

So these sequences are allowed per the ISO/IEC 10646 definition, but not the RFC 3629 definition? Which one should I follow?

Thanks in advance.

+4  A: 

The are no Unicode characters beyond 10FFFF, the BMP covers 0000 through FFFF.

UTF-8 is well-defined for 0-10FFFF.

devio
Thanks, that makes sense. Does this mean I only need to worry about UTF-8 sequences longer than 4 octets, with anything longer being an error?
Patrick Niedzielski
A: 

Both UTF-8 and UTF-16 allow all Unicode characters to be encoded. What UTF-8 is not allowed to do is to encode upper and lower surrogate halves (which UTF-16 uses) or values above U+10FFFF, which aren't legal Unicode.

Note that the BMP ends at U+FFFF.

chryss
A: 

I would have to say no: Unicode code points are valid for the range [0, 0x10FFFF], and those map to 1-4 octets. So, if you did come across a 5- or 6-octet UTF-8 encoded code point, it's not a valid code point - there's certainly nothing assigned there. I am a little baffled as to why they're there in the ISO standard - I couldn't find an explanation.

It does make you wonder, however, if perhaps someday in the future, they would expand past U+10FFFF. 0x10FFFF allows for over a million characters, but there are a lot characters out there, and it would depend how much eventually gets encoded. (For sanity's sake, let's hope not, a million characters is a lot!) UTF-32 could handle more code points, and as you've discovered, UTF-8 could. It'd really be UTF-16 that's out of luck - more surrogate pairs would be needed somewhere in the spectrum of code points.

Thanatos
The ISO had originally intended to introduce their own 31-bit character encoding. UTF-8 was designed around that possibility.
dan04
To me, it seems Unicode is trying to fill up the rest of the codepoints...that they have more than they know what to do with. Example: there is a block for Mahjong playing blocks. However, there certainly are some useful characters outside the BMP that I need to support. Most of them are rubbish, though.It makes me wonder why they didn't accept Klingon characters a while back.
Patrick Niedzielski