views:

61

answers:

3

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).

There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).

Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?

By "UTF-8-like", I mean, at minimum:

  • The bytes 0x00-0x7F are reserved for ASCII characters.
  • Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.

Update -- My first attempt to answer the question

Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:

  • A = the number of single-byte characters
  • B = the number of values used for leading bytes of 2-byte characters
  • C = the number of values used for leading bytes of 3-byte characters
  • T = 256 - (A + B + C) = the number of values used for trailing bytes

Then the number of characters that can be supported is N = A + BT + CT².

Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.

Is there a different approach that could encode more characters?

+4  A: 

It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.

I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.

Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.

As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.

Mark Ransom
+100 if I could. There is a reason we create standards.
Ed Swangren
Oh, I don't think the world needs another character encoding. Just curious.
dan04
+1  A: 

Sure it's possible. Proof:

224 = 16,777,216

So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.

Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:

256x + 65536(128-x) = 1114112 - 128

Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:

128 + 111 * 256 + 17 * 65536 = 1,114,256

To put it another way:

  • 128 code points require 1 byte;
  • 28,416 code points require 2 bytes; and
  • 1,114,112 code points require 3 bytes.

Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:

  • 0-127 (128) = 1 byte;
  • 128-191 (64) = 2 bytes;
  • 192-255 (64) = 3 bytes.

This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

cletus
But if you allow arbitrary trailing bytes, you can find a 1-byte or 2-byte character within the trailing bytes of 3-byte characters. This makes it not "UTF-8-like" by the definition in the OP.
dan04
+1  A: 

I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").

To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.

By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.

This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.

zildjohn01