ansaurus

Question

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

Answer 1

+2 A:

Short answer: no.

Longer answer: yes, for compatibility with other things that didn't get the memo.

Less sarcastic answer: When you care more about speed of indexing than about space usage, or as an intermediate format of some sort, or on machines where alignment issues were more important than cache issues, or...

MarkusQ 2009-03-09 04:35:34

Answer 2

+6 A:

In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.

Raminder 2009-03-09 04:38:58

Actually UTF-32 is wasteful for most texts, not just for english characters. Because most living languages have all (or at least most) of their glyphs well within the range that doesn't require surrogate pairs in UTF-16.

Joachim Sauer 2010-07-19 12:49:53

Answer 3

+2 A:

UTF-8 can also represent any unicode character!

If your text is mostly english, you can save a lot of space by using utf-8, but indexing characters is not O(1), because some characters take up more than just one byte.

If space is not as important to your situation as speed is, utf-32 would suit you better, because indexing is O(1)

UTF-16 can be better than utf-8 for non-english text because in utf-8 you have a situation where some characters take up 3 bytes, where as in utf16 they'd only take up two bytes.

hasen j 2009-03-09 05:09:41

Apparently UTF-32 is programmatically faster, even if you would save alot of space using UTF-8, due to being able to process using a more efficient word size (ie, 32-bits, rather than handling each 8-bit chunk at a time) -though, with a (substantially) more complex UTF-8 library, that's a non-issue.

Arafangion 2009-03-09 08:21:46

Answer 4

+3 A:

Someone might prefer to deal with UTF-32 instead of UTF-16 because dealing with surrogate pairs is pretty much always handling 'special-cases', and having to deal with those special cases means you have areas where bugs may creep in because you deal with them incorrectly (or more likely just forget to deal with them at all).

If the increased memory usage of UTF-32 is not an issue, the reduced complexity might be enough of an advantage to choose it.

Michael Burr 2009-03-09 06:10:19

Answer 5

+3 A:

There are probably a few good reasons, but one would be to speed up indexing / searching, i.e. in databases and the like.

With UTF-32 you know that each character is 4 bytes. With UTF-16 you don't know what length any particular character will be.

For example, you have a function that returns the nth char of a string:

char getChar(int index, String s );

If you are coding in a language that has direct memory access, say C, then in UTF-32 this function may be as simple as some pointer arithmatic (s+(4*index)), which would be some amounts O(1).

If you are using UTF-16 though, you would have to walk the string, decoding as you went, which would be O(n).

SCdF 2009-03-09 08:56:46

Answer 6

+2 A:

Here is a good documentation from The Unicode Consortium too.

Comparison of the Advantages of UTF-32, UTF-16, and UTF-8

c411 2010-07-19 12:47:15

Answer 7

+1 A:

In general, you just use the string datatype/encoding of the underlying platform, which is often (Windows, Java, Cocoa...) UTF-16 and sometimes UTF-8 or UTF-32. This is mostly for historical reasons; there is little difference between the three Unicode encodings: all three are well-defined, fast and robust, and all of them can encode every Unicode code point sequence. The unique feature of UTF-32 that it is a fixed-width encoding (meaning that each code point is represented by exactly one code unit) is of little use in practice: Your memory management layer needs to know about the number and width of code units, and users are interested in abstract characters and graphemes. As mentioned by the Unicode standard, Unicode applications have to deal with combined characters, ligatures and so on anyway and the handling of surrogate pairs, despite being conceptually different, can be done within the same technical framework.

If I were to reinvent the world, I'd probably go for UTF-32 because it is simply the least complex encoding, but as it stands the differences are too small to be of practical concern.

Philipp 2010-07-19 12:58:23

ansaurus

tags:

views:

answers:

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

related questions