tags:

views:

240

answers:

6

What are the typical average bytes-per-character rates for different unicode encodings in different languages?

E.g. if I wanted the smallest number of bytes to encode some english text, then on average UTF-8 would be 1-byte per character and UTF-16 would be 2 so I'd pick UTF-8.

If I wanted some Korean text, then UTF-16 might average about 2 per character but UTF-8 might average about 3 (I don't know, I'm just making up some illustrative numbers here).

Which encodings yield the smallest storage requirements for different languages and character sets?

+1  A: 

For any given language, your bytes-per-character rates are fairly constant, because most languages are allocated to contiguous code pages. The big exception is accented Latin characters, which are allocated higher in the code space than the unaccented forms. I don't have hard numbers for these.

For languages with contiguous character allocation, there is a table with detailed numbers for various languages on Wikipedia. In general, UTF-8 works well for most small character sets (except the ones allocated on high code pages), and UTF-16 is great for two-byte character sets.

If you need denser compression, you may also want to look at Unicode Technical Note 14, which compares some special-purpose encodings designed to reduce data size for a variety of languages. But these techniques aren't especially common.

emk
A: 

UTF-8

There is a very good article about unicode on JoelOnSoftware:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

nruessmann
+1  A: 

UTF8 is best for any character-set where characters are primarily below U+0800. Otherwise UTF16.

That is, UTF8 for Latin, Greek, Cyrillic, Hebrew and Arabic and a few others. In langs other than Latin, characters will take up the same space as they would in UTF16, but you'll save bytes on punctuation and spacing.

A: 

If you're really worried about string/character size, have you thought about compressing them? That would automatically reduce the string to it's 'minimal' encoding. It's a layer of headache, especially if you want to do it in memory, and there are plenty of cases in which it wouldn't buy you anything, but encoding, especially, tend to be too general purpose to the level of compactness you seem to be aiming for.

sblundy
+1  A: 

In UTF-16, all the languages that matter (i.e. anything but klingons, elven and other strange things) will be encoded into 2 byte chars.

So the question is to find the languages that will have glyphs that will be 2-bytes or 1-byte sized characters long.

In the Wikipedia page on UTF-8: http://en.wikipedia.org/wiki/Utf-8

We see that a character with an unicode index of 0x0800 or more will be at least 3 bytes long in UTF-8.

Knowing that, you just need to look at the code charts on unicode: http://www.unicode.org/charts/

for the languages that comply to your requirements.

:-)

Now, note that, depending on the framework you're using, the choice could well be not yours to do:

  • On Windows API, Unicode is handled by wchar_t chars, and is UTF-16
  • On Linux, Unicode is handled by char, and is UTF-8
  • Java is internally UTF-16, as are most compliant XML parsers
  • I was told (some tech meeting I was not interested on... sorry...) that UTF-8 was the encoding of choices on Databases.

So, pick up your poison...

:-)

paercebal
A: 

I don't know exact figures, but for Japanese Shift_JIS averages fewer bytes per character than UTF-8, and so does EUC-JP, since they're optimised for Japanese text. However, they don't cover the same space of code points as Unicode, so they might not be correct answers to your question.

UTF-16 is better than UTF-8 for Japanese characters (2 bytes per char as opposed to 3), but worse than UTF-8 if there's a lot of 7-bit chars. It depends on the context - technical text is more likely to contain a lot of chars in the 1-byte range. A classical Japanese text might not have any.

Note that for transport, the encoding doesn't matter much if you can zip (gzip, bz2) the data. Code points for an alphabet in Unicode are close together, so you'd expect common prefixes with very short representations in the compressed data.

UTF-8 is usually good for representation in memory, since it's often more compact than UTF-32 or UTF-16, and is compatible with functions on char* which 'expect' ASCII or ISO-8859-1 NUL-terminated strings. It's useless if you need random access to characters by index, though.

If you don't care about non-BMP characters, UCS-2 is always 2 bytes per character and so offers random access. But that depends what you mean by 'Unicode'.

Steve Jessop