views:

652

answers:

6

What is the difference between UTF and UCS.

What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:

  • Internal representation inside the code
    • For string manipulation at run-time
    • For using the string for display purposes.
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)
A: 

UTC is Coordinated Universal Time, not a character set (I didn't find any charset called UTC).

For internal representation, you may want to use wchar_t for each character, and std::wstring for strings. They use exactly 2 bytes for each character, so seeking and random access will be fast.

For storage, if most of the data are not ASCII (i.e. code >= 128), you may want to use UTF-16 which is almost the same as serialized wstring and wchar_t.

Since UTF-16 can be little endian or big endian, for wire transport, try to convert it to UTF-8, which is architecture-independent.

yuku
The size of wchar_t (and therefore internally neither is wstring) is not defined I have seen both 2 and 4 byte versions. Why UTS-16 for storage but UTF-8 for files (Files may be saved on one machine and loaded on another). I want to understand why you made the choice as well as the choice.
Martin York
http://en.wikipedia.org/wiki/Universal_Character_Set
Jason Dagit
@Martin: UTF-16 cannot be processed by existing ASCII-oriented tools because many bytes are 0, which makes per-byte functions believe the NULL terminator has been reached.
John Millikin
+2  A: 

I would suggest:

  • For representation in code, wchar_t or equivalent.
  • For storage representation, UTF-8.
  • For wire representation, UTF-8.

The advantage of UTF-8 in storage and wire situations is that machine endianness is not a factor. The advantage of using a fixed size character such as wchar_t in code is that you can easily find out the length of a string without having to scan it.

Greg Hewgill
wchar_t: But what encoding? Are you suggesting UTF-16 internally?
Martin York
On many Unix platforms, wchar_t is 32 bits, so this is easy. On platforms where wchar_t is 16 bits, yes, UTF-16 would be the way to go.
Chris Jester-Young
Martin: I rolled back your edit because using wchar_t does not imply UTF-16 -- in UNIX, sizeof(wchar_t) == 4.
John Millikin
Fair enough, that was a bad edit. But a wchar_t can hold a UTF-16 'Code Point' and wchar_t has no implied representation and thus you could hold any encoding in it (size permitting). So what I am lloking for is how should I store a string internally for manipulation and display purposes?
Martin York
See my answer: use whatever is used on your platform. Windows: UTF-16. UNIX: UCS-4. The data type used is incidental, they're all just typedefs anyway.
John Millikin
Using UTF-32 as internal storage (i.e. on some Unix flavors) is a horrible waste of memory, and it is not recommended by Unicode Standard.
Nemanja Trifunovic
+4  A: 

Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?

Michael Burr
That and much more. But I want the opinions of more than one person to get what is happing in industry code.
Martin York
yeah, a very good read for everyone..
Konstantinos
+8  A: 

What is the difference between UTF and UCS.

UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.

UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.

  • Internal representation inside the code
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.

Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:

  • UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
  • UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.

Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.

John Millikin
No, UTF encodings are not always variable-width (think of UTF-32, for instance).
bortzmeyer
Utf-32 may use a fixed width for each codepoint, but I *think* you can still have (and need to accept and normalize to one codepoint) multiple codepoints (when you have combining chars) that represent one complete char/glyph. If so, UTF-32 is not really much better than UTF-16.
Shadow2531
@bortzmeyer: UTF-32 is really just UCS-4 with a few extra restrictions. Honestly, I've never seen UTF-32 used *anywhere*, so I tend to just ignore it.
John Millikin
Shadow: Combining characters are not really an issue unless you're writing a text renderer, while UTF-16's surrogate pairs mean special processing code is required for *all* characters.
John Millikin
@John Millikin: I believe UTF-32 is used in most modern Linux systems (as opposed to the UTF-16 used in Windows).
Head Geek
@Head Geek: Linux might support UTF-32 but it uses UTF-8 natively (as opposed to native UTF-16 use in Windows) . http://www.cl.cam.ac.uk/~mgk25/unicode.html
sean e
A: 

In internal representation inside the code, you'd better do this for both European and non-European characters:

\uNNNN

Characters in the range \u0020 to \u007E, and a little bit of whitespace (e.g. end of line) can be written as ordinary characters. Anything above \u0080, if you write it as an ordinary character then it will compile only in your code page (e.g. OK in France but breaking in Russia, OK in Russia but breaking in Japan, OK in China but breaking in the US, etc.).

Windows programmer
A: 

See Chapter 5 of Unicode Standard

Nemanja Trifunovic