views:

3346

answers:

16

I'm going to ask what is probably quite a controversial question: "Should one of the most popular encodings, UTF-16, be considered harmful?"

Why do I ask this question?

How many programmers are aware of the fact that UTF-16 is actually a variable length encoding? By this I mean that there are code points that, represented as surrogate pairs, take more then one element.

I know; lots of applications, frameworks and APIs use UTF-16, such as Java's String, C#'s String, Win32 APIs, Qt GUI libraries, the ICU Unicode library, etc. However, with all of that, there are lots of basic bugs in the processing of characters out of BMP (characters that should be encoded using two UTF-16 elements).

For example, try to edit one of these characters:

  • 𝄞
  • 𝕥
  • 𝟶
  • 𠂊

You may miss some, depending on what fonts you have installed. These characters are all outside of the BMP (Basic Multilingual Plane). If you cannot see these characters, you can also try looking at them in the Unicode Character reference.

For example, try to create file names in Windows that include these characters; try to delete these characters with a "backspace" to see how they behave in different applications that use UTF-16. I did some tests and the results are quite bad:

  • Opera has problem with editing them (delete required 2 presses on backspace)
  • Notepad can't deal with them correctly (delete required 2 presses on backspace)
  • File names editing in Window dialogs in broken (delete required 2 presses on backspace)
  • All QT3 applications can't deal with them - show two empty squares instead of one sybbol.
  • Python encodes such characters incorrectly when used directly u'X'!=unicode('X','utf-16') on some platforms when X in character outside of BMP.
  • Python 2.5 unicodedata fails to get properties on such characters when python compiled with UTF-16 Unicode strings.
  • StackOverflow seems to remove these characters from the text if edited directly in as Unicode characters (these characters are shown using HTML Unicode escapes).

It seems that such bugs are extremely easy to find in many applications that use UTF-16.

So... Do you think that UTF-16 should be considered harmful?

+11  A: 

I would suggest that thinking UTF-16 might be considered harmful says that you need to gain a greater understanding of unicode.

Since I've been downvoted for presenting my opinion on a subjective question, let me elaborate. What exactly is it that bothers you about UTF-16? Would you prefer if everything was encoded in UTF-8? UTF-7? Or how about UCS-4? Of course certain applications are not designed to handle everysingle character code out there - but they are necessary, especially in today's global information domain, for communication between international boundaries.

But really, if you feel UTF-16 should be considered harmful because it's confusing or can be improperly implemented (unicode certainly can be), then what method of character encoding would be considered non-harmful?

EDIT: To clarify: Why consider improper implementations of a standard a reflection of the quality of the standard itself? As others have subsequently noted, merely because an application uses a tool inappropriately, does not mean that the tool itself is defective. If that were the case, we could probably say things like "var keyword considered harmful", or "threading considered harmful". I think the question confuses the quality and nature of the standard with the difficulties many programmers have in implementing and using it properly, which I feel stem more from their lack of understanding how unicode works, rather than unicode itself.

patjbs
-1: How about addressing some of Artyom's objections, rather than just patronising him?
RichieHindle
BTW: When I started writing this article I almost wanted to write "Does Joel on Softeare article of Unicode should be considered harmful" because there are **many** mistakes. For example: utf-8 encoding takes up to 4 characters and not 6. Also it does not distinguish between UCS-2 and UTF-16 that are really different -- and actually cause the problems I talk about.
Artyom
My point is that those character points are designed and implemented for specific tasks. The "bugs" you describe are no different than the "bugs" one would encounter if you attempted to give input outside the scope of any application.
patjbs
I agree with the last edit. The simplest example: we still use C and C++ though both languages use pointers and thus are not safe.
Malcolm
Also, it should be noted that when Joel wrote that article, the UTF-8 standard WAS 6 bytes, not 4. RFC 3629 changed the standard to 4 bytes several months AFTER he wrote the article. Like most anything on the internet, it pays to read from more than one source, and to be aware of the age of your sources. The link wasn't intended to be the "end all be all", but rather a starting point.
patjbs
Actually, the problem is not with the standard. It is 100% ok. In fact, there are good implementations that work with utf-16: ICU, Java Swing etc. But, the problem is that there are too much **basic** bugs in processing of surragate pairs when working with utf-16, such, you should probably never pic utf-16 for internal encoding of new applications... Because there are **lot** of real life examples where utf-16 nature causes big troubles: even Stackoverlow can't deal with them
Artyom
Not to try and flog a dead horse here, but if you shouldn't pick utf-16 as the reasonable standard, what should you pick? I'm interested in your perspective on what an acceptable alternative would be. For instance, a lot of my work involves working with ancient languages (greek, aramaic, hebrew, syriac, etc), and work a lot with these oddball unicode characters, so I'm constantly having to transition documents between utf-8, 16 and 32.
patjbs
I would pic: utf-8 or utf-32 that are: variable length encoding in almost all cases (including BMP) or fixed length encoding always.
Artyom
Artyom, SO doesn't NEED to use UTF-16, since UTF-8 is the de facto standard for storage and communication of text, while UTF-16 is the de facto standard for processing of text. I don't know of any web page using UTF-16, and it wouldn't be really bold to do so, especially since a really popular language has no Unicode support: PHP (and UTF-16 isn't really easy to deal with; UTF-8 is the standard encoding in most Linux installs, where PHP is commonly run).
iconiK
+16  A: 

There is nothing wrong with Utf-16 encoding. But languages that treat the 16-bit units as characters should probably be considered badly designed. Having a type named 'char' which does not always represent a character is pretty confusing. Since most developers will expect a char type to represent a code point or character, much code will probably break when exposed to characters beyound BMP.

Note however that even using utf-32 does not mean that each 32-bit code point will always represent a character. Due to combining characters, an actual character may consist of several code points. Unicode is never trivial.

BTW. There is probably the same class of bugs with platforms and applications which expect characters to be 8-bit, which are fed Utf-8.

JacquesB
In Java's case, if you look at their timeline (http://www.java.com/en/javahistory/timeline.jsp), you see that the primarily development of String happened while Unicode was 16 bits (it changed in 1996). They had to bolt on the ability to handle non BMP code points, thus the confusion.
Kathy Van Stone
@Kathy: Not really an excuse for C#, though. Generally, I agree, that there should be a `CodePoint` type, holding a single code point (21 bits), a `CodeUnit` type, holding a single code unit (16 bits for UTF-16) and a `Character` type would ideally have to support a complete grapheme. But that makes it functionally equivalent to a `String` ...
Joey
+9  A: 

Well, there is an encoding that uses fixed-size symbols. I certainly mean UTF-32. But 4 bytes for each symbol is too much of wasted space, why whould we use it in everyday situations?

Actually I don't undesrstand why it's so big deal anyway. Characters outside BMP are encountered only in very specific cases and areas. Most programs that use UTF-16 are not intended for working with texts containing such characters, so why bother with support for what won't be used anyway?

I don't think it should be considered harmful, but on the other hand it doesn't mean developers shouldn't be mindful. Use what is needed where it is needed. And this is exactly my point: if you use mostly English, use UTF-8, if you use mostly cyrillics or Japanese, use UTF-16, if you use ancient languages, use UTF-32. No harm in using the most appropirate method for what you do, just do it properly, of course.

Malcolm
If a program uses UTF-16, shouldn't it be used "correctly"?
Albert
Certainly. But that doesn't mean that if someone can use something incorrectly, we shouldn't use it at all, right?
Malcolm
That's a rather blinkered, Anglo-centric view, Malcolm. Almost on a par with "ASCII is good enough for the USA - the rest of the world should fit in with us".
Jonathan Leffler
Actually I'm from Russia and encounter cyrillics all the time (including my own programs), so I don't think that I have Anglo-centric view. :) Mentioning ASCII is not quite appropirate, because it's not Unicode and doesn't support specific characters. UTF-8, UTF-16, UTF-32 support the very same international character sets, they are just intended for use in their specific areas. And this is exactly my point: if you use mostly English, use UTF-8, if you use mostly cyrillics, use UTF-16, if you use ancient languages, use UTF-32. Quite simple.
Malcolm
But you might not know in advance if your application need to handle characters outside BMP, if the application accepts data like names. For example some asian names might be written with characters outside of BMP.
JacquesB
Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays, it's not like it includes only European scripts or something. No, if you are really going to encounter non-BMP characters, you'll almost definitely know it.
Malcolm
@Malcolm: The issue is more complex than that. See eg. http://www.jbrowse.com/text/unij.html
JacquesB
And what did I write wrong? All characters of plane 2 contain only rare or historic symbols and all other characters fit into BMP and thus don't need surrogate pairs.
Malcolm
@Malcolm: This issue is that some people apparently have names containing these rare symbols, even though they does not otherwise occur in regular language.
JacquesB
There is, but it's not really a problem specific for Unicode since standart encodings also don't include this characters. People use homophones and other ways to write such names, and that can be done in any encoding, including Unicode. Probably there are serious difficulties even with inputting rare symbols, so the situation doesn't happen all of a sudden and users won't be surprised to find out the program is refusing to accept them correctly if it does.
Malcolm
"Not true, Asian scripts like Japanese, Chinese or Arabic belong to BMP also. BMP itself is actually very large and certainly large enough to include all the scripts used nowadays"This is all so wrong. BMP contains 0xFFFF characters (65536). Chinese alone has more than that. Chinese standards (GB 18030) has more than that. Unicode 5.1 already allocated more than 100,000 characters.
Mihai Nita
It does, but characters outside BMP are not for everyday use, they can be used, for example, for old texts or to write names with rare hieroglyphs in them. And all characters that are commonly used fit into BMP.
Malcolm
@Marcolm: "BMP itself is actually very large and certainly large enough to include all the scripts used nowadays"Not true.At this point Unicode already allocated about 100K characters, way more than BMP can accomodate.There are big chunks of Chinese characters outside BMP. And some of them are required by GB-18030 (mandatory Chinese standard). Other are required by (non-mandatory) Japanese and Korean standards.So if you try to sell anything in those markets, you need beyond BMP support.
Mihai Nita
If BMP is *that* far from having enough capacity to write normally in Chinese, how do they manage to write in such encodings as GBK or GB 2312? It is clear that support of other planes would be useful, but nonetheless.
Malcolm
All the currently used languages in the world fir in the BMP, in 64k code points. Anything outside of the BMP is not for current use of the language; it's for old characters, for old languages, for exotic characters, or even Klingon. If Chinese and/or Japanese and/or Korean need characters out of the BMP, how did they handle this before Unicode was widely adopted? Nearly all the encodings used in Asia were variable-length, using 8 or 16 bits per character.
iconiK
+8  A: 

My personal choice is to always use UTF-8. It's the standard on Linux for nearly everything. It's backwards compatible with many legacy apps. There is a very minimal overhead in terms of extra space used for non-latin characters vs the other UTF formats, and there is a significant savings in space for latin characters. On the web, latin languages reign supreme, and I think they will for the foreseeable future. And to address one of the main arguments in the original post: nearly every programmer is aware that UTF-8 will sometimes have multi-byte characters in it. Not everyone deals with this correctly, but they are usually aware, which is more than can be said for UTF-16. But, of course, you need to choose the one most appropriate for your application. That's why there's more than one in the first place.

rmeador
UTF-16 is simpler for anything inside BMP, that's why it is used so widely. But I'm a fan of UTF-8 too, it also has no problems with byte order, which works to its advantage.
Malcolm
@Malcolm: UTF-16 also has no problems with byte order as it requires a BOM which specifies the order :-)
Joey
Theoretically, yes. In practice there are such things as, say, UTF-16BE, which means UTF-16 in big endian without BOM. This is not some thing I made up, this is an actual encoding allowed in ID3v2.4 tags (ID3v2 tags suck, but are, unfortunately, widely used). And in such cases you have to define endianness externally, because the text itself doesn't contain BOM. UTF-8 is always written one way and it doesn't have such a problem.
Malcolm
A: 

My guesses as to the why the Windows API (and presumably the Qt libraries) use UTF-16:

  • UTF-8 wasn't around when these APIs were being developed.
  • The OS needs to do a lookup on the code points to display the glyphs-- if the data is passed around internally as UTF-8, every time it needs to do that for a multibyte character, it would have to convert from UTF-8 to UTF-16/32. If the bytestream is stored as "wide" chars in memory, it won't need to do this conversion. So increased memory usage is a tradeoff for decreased conversion work and complexity.

When writing to a stream, however, it's considered best practice to use UTF-8 for the reasons outlined in the Joel article referenced above.

pjbeardsley
Actually UTF-8 was before utf-16 developed. At the begining there was UCS-2 because at these days unicode code point **was** at most 16 bits
Artyom
Actually UTF-8 was around before these APIs were developed too - it was invented in 1992. The very first OS to implement any sort of UCS/Unicode support was Plan9, and it used UTF-8.
R..
+5  A: 

UTF-16 is the best compromise between handling and space and that's why most major platforms (Win32, Java, .NET) use it for internal representation of strings.

Nemanja Trifunovic
-1 because UTF-8 is likely to be smaller or not significantly different. For certain Asian scripts UTF-8 is three bytes per glyph while UTF-16 is only two, but this is balanced by UTF-8 being only one byte for ASCII (which does often appear even within asian languages in product names, commands and such things). Further, in the said languages, a glyph conveys more information than a latin character so it is justified for it to take more space.
Tronic
Thanks for the downvote, but I still don't get which part of the "best compromise between handling and space" you consider wrong. Note the word "compromise". Or maybe you don't believe that Win32, Java and .NET (also ICU, btw) use UTF-16 internally?
Nemanja Trifunovic
I would not call combining the worst sides of both options a good compromise.
Tronic
It is the *best* of both worlds: it is pretty easy to handle, unlike UTF-8, and does not take nearly as much memory as UTF-32.
Nemanja Trifunovic
It's not easier than UTF-8. It's variable-length too.
luiscubal
Nemanja Trifunovic
Leaving debates about the benefits of UTF-16 aside: What you cited is *not* the reason for Windows, Java or .NET using UTF-16. Windows and Java date back to a time where Unicode was a 16-bit encoding. UCS-2 was a reasonable choice back then. When Unicode became a 21-bit encoding migrating to UTF-16 was the best choice existing platforms had. That had nothing to do with ease of handling or space compromises. It's just a matter of legacy.
Joey
@Johannes: It is a matter of legacy in case of Win32 and Java, but not .NET and especially not Python 3.
Nemanja Trifunovic
.NET inherits the Windows legacy here.
Joey
That's why I said "especially not Python 3", but it would have been perfectly feasible to implement even .NET strings as UTF-8. Of course, interop with Win32 is easier with UTF-16 strings.
Nemanja Trifunovic
+9  A: 

Years of Windows internationalization work especially in East Asian languages might have corrupted me, but I lean toward UTF-16 for internal-to-the-program representations of strings, and UTF-8 for network or file storage of plaintext-like documents. UTF-16 can usually be processed faster on Windows, though, so that's the primary benefit of using UTF-16 in Windows.

Making the leap to UTF-16 dramatically improved the adequacy of average products handling international text. There are only a few narrow cases when the surrogate pairs need to be considered (deletions, insertions, and line breaking, basically) and the average-case is mostly straight pass-through. And unlike earlier encodings like JIS variants, UTF-16 limits surrogate pairs to a very narrow range, so the check is really quick and works forward and backward.

Granted, it's roughly as quick in correctly-encoded UTF-8, too. But there's also many broken UTF-8 applications that incorrectly encode surrogate pairs as two UTF-8 sequences. So UTF-8 doesn't guarantee salvation either.

IE handles surrogate pairs reasonably well since 2000 or so, even though it typically is converting them from UTF-8 pages to an internal UTF-16 representation; I'm fairly sure Firefox has got it right too, so I don't really care what Opera does.

UTF-32 (aka UCS4) is pointless for most applications since it's so space-demanding, so it's pretty much a nonstarter.

JasonTrue
I didn't quite get your comment on UTF-8 and surrogate pairs. Surrogate pairs is only a concept that is meaningful in the UTF-16 encoding, right? Perhaps code that converts directly from UTF-16 encoding to UTF-8 encoding might get this wrong, and in that case, the problem is incorrectly reading the UTF-16, not writing the UTF-8. Is that right?
Craig McQueen
What Jason's talking about is software that deliberately implements UTF-8 that way: create a surrogate pair, then UTF-8 encode each half separately. The correct name for that encoding is CESU-8, but Oracle (e.g.) misrepresents it as UTF-8. Java employs a similar scheme for object serialization, but it's clearly documented as "Modified UTF-8" and only for internal use. (Now, if we could just get people to READ that documentation and stop using DataInputStream#readUTF() and DataOutputStream#writeUTF() inappropriately...)
Alan Moore
+14  A: 

There is a simple rule of thumb on what Unicode Transformation Form (UTF) to use: - utf-8 for storage and comunication - utf-16 for data processing - you might go with utf-32 if most of the platform API you use is utf-32 (common in the UNIX world).

Most systems today use utf-16 (Windows, Mac OS, Java, .NET, ICU, Qt). Also see this document: http://unicode.org/notes/tn12/

Back to "UTF-16 as harmful", I would say: definitely not.

People who are afraid of surrogates (thinking that they transform Unicode into a variable-length encoding) don't understand the other (way bigger) complexities that make mapping between characters and a Unicode code point very complex: combining characters, ligatures, variation selectors, control characters, etc.

Just read this series here http://blogs.msdn.com/michkap/archive/2009/06/29/9800913.aspx and see how UTF-16 becomes an easy problem.

Mihai Nita
+1  A: 

This totally depends on your application. For most people, UTF-16BE is a good compromise. Other choices are either too expensive to find characters (UTF-8) or waste too much space (UTF-32 or UCS-4, where each character takes 4 bytes).

With UTF-16BE, you can treat it as UCS-2 (fixed length) in most cases. Characters beyond BMP are rare in normal applications. You still have the option to handle surrogate pair if you choose to, say you are writing an archaeology application.

ZZ Coder
With all widely-used processor architectures being LE (x86, x86-64, IA-64, ARM, etc.), using UTF-16BE would be masochism.
iconiK
Why is it "too expensive" to find characters?
luiscubal
+42  A: 
Pavel Radzivilovsky
I would like to add a little comment. Most of Win32 "ASCII" functions receive locale strings in local encodings. For example std::ifstream can accept Hebrew file name if locale encoding is Hebrew one like 1255. Anything needed to support these encodings for windows is make MS add UTF-8 code page to the system. This would make the life much simpler. So all "ASCII" functions would be fully Unicode capable.
Artyom
FWIW the AfxMessageBox(MfcUtils::Convert(s), _T("Error"), MB_OK) example should probably really have been a call to a wrapper of that function that accepts std::string(s). Also, the Assert(false) in the functions toward the end should be replaced with static assertions.
Assaf Lavie
I can't agree. The advantages of utf16 over utf8 for many Asian languages completely dominate the points you make. It is naive to hope that the Japanese, Thai, Chinese, etc. are going to give up this encoding.The problematic clashes between charsets are when the charsets mostly seem similar, except with differences.I suggest standardising on: fixed 7bit: iso-irv-170; 8bit variable: utf8; 16bit variable: utf16; 32bit fixed: ucs4.
Charles Stewart
@Charles: thanks for your input. True, some BMP characters are longer in UTF-8 than in UTF-16. But, let's face it: the problem is not in bytes that BMP Chinese characters take, but the software design complexity that arises. If a Chinese programmer has to design for variable-length characters anyway, it seems like UTF-8 is still a small price to pay compared to other variables in the system. He might use UTF-16 as a compression algorithm if space is so important, but even then it will be no match for LZ, and after LZ or other generic compression both take about the same size and entropy.
Pavel Radzivilovsky
What I basically say is that simplification offered by having One encoding that is also compatible with existing char* programs, and is also the most popular today for everything is unimaginable. It is almost like in good old "plaintext" days. Want to open a file with a name? No need to care what kind of unicode you are doing, etc etc.I suggest we, developers, confine UTF-16 to very special cases of severe optimization where a tiny bit of performance is worth man-months of work.
Pavel Radzivilovsky
Well, if I had to choose between UTF-8 and UTF-16, I would definitely stick to UTF-8 as it has no BOM, ASCII-compliant and has the same encoding scheme for any plane. But I have to admit that UTF-16 is simpler and more efficient for most BMP characters. There's nothing worng with UTF-16 except the psychological aspects (mostly fixed-size isn't fixed size). Sure, one encoding would be better, but since both UTF-8 and UTF-16 are widely used, they have their advantages.
Malcolm
@Malcolm: UTF-8, unfortunately, has a BOM too (0xEFBBBF). As silly as it looks (no byte order problem with single-byte encoding), this is true, and it is there for a different reason: to manifest this is a UTF stream. I have to disagree with you about BMP efficiency and UTF-16 popularity. It seems that majority of UTF-16 software do not support it properly (ex. all win32 API - which I am a fan of) and this is inherent, the easiest way to fix these seems to switch them to other encoding. The efficiency argument is only true for very narrow set of uses (I use hebrew, and even there it is not).
Pavel Radzivilovsky
Well, what I meant is that you don't have to worry about byte order. UTF-8 can have a BOM indeed (it is actually UTF-16 big endian BOM encoded in 3 bytes), though it's neither required, nor recommended according to the standard. As for the APIs I think the problem is that they were designed when surrogate pairs were either non-existent yet, or not really adopted. And when something gets patched up, it's always not as good as redesigning from the scratch. The only (painful) way is to drop any backwards compability and redesign the APIs. Should they switch to UTF-8 in the process, I don't know.
Malcolm
@Malcolm, I think the natural way of this redesign is thru changing existing ANSI APIs. This way existing broken programs will unbreak (see my answer). This adds to the argument: UTF-16 must die.
Pavel Radzivilovsky
I'm sorry, I didn't really get the idea why transition to UTF-8 should be less painful.I also think that inconsistency in C++ makes it worse. Say, Java is very specific on the characters: char[] is no more than a char array, String is a string and Character is a character. Meanwhile, C++ is a mess with all the new stuff added to an existing language. To my mind, they should've abandoned any backwards compablity and design C++ in the way that doesn't allow to mix up structural programming and OOP or Unicode and other encodings. Not that I want to start a holy war, that's merely my opinion.
Malcolm
UTF-8's disadvantage is NOT a small price to pay at all. Looking for any character is a O(n) operation, and other more complex operations can be far far worse than with UTF-16. Also UTF-8 is variable-length, just as UTF-16, so what's the point? UTF-8 was designed for storage and interoperability with ASCII. UTF-16 is the preferred way to store strings in memory, as anything outside the BMP is incredibly rare (you're wiring in Klingon?). With a little trick, storing characters outside of the BMP in a hash or map, UTF-16 can have constant processing time.
iconiK
@iconiK: non-english BMP is also quite rare. Consider all program sources and markup languages. One should have very good reasons to use UTF-16. See what is going on in Linux world wrt unicode to measure the price of breaking changes.
Pavel Radzivilovsky
Linux has had a specific requirement when choosing to use UTF-8 internally: compatibility with Unix. Windows didn't need that, and thus when the developers implemented Unicode, they added UCS-2 versions of almost all functions handling text and made the multibyte ones simply convert to UCS-2 and call the other ones. THey later replaces UCS-2 with UTF-16. Linux on the other hand kept to 8-bit encodings and thus used UTF-8, as it's the proper choice in that case.
iconiK
you may wish to read my answer again. Windows does not support UTF-16 properly to date. Also, the reason for choosing UCS-2 was different. Again, see my answer. For linux, I believe the main reason was compatibility not with unix but with existing code - for instance, if your ANSI app copies files, getting names from command arguments and calling system APIs, it will remain completely intact with UTF-8. Isn't that wonderful?
Pavel Radzivilovsky
@Pavel: The bug you linked to (Michael Kaplan's blog entry) has long been resolved by now. Michael said in the post already that it's fixed in Vista and I can't reproduce it on Windows 7 as well. While this doesn't fix legacy systems running on XP, saying that »there is still no proper support« is plain wrong.
Joey
@Johannes: [1] many thanks for the info. [2] IMO a programmer, today, should be able to write programs that support windows XP. It is still a popular one, and I don't know of a windows update that fixes it.
Pavel Radzivilovsky
Well, the program works just fine; it just has a little trouble dealing with astral planes, but that's an OS issue, not one with your program. It's like asking that current versions of Uniscribe are backported to old OSes that people on XP can enjoy a few scripts that would render improperly before. It's not something MS does. Besides, XP is almost a decade old by now and supporting it becomes a major burden in some cases (see for example the reasoning why Paint.NET will require Vista with its 4.0 release). Mainstream support for that OS has already ended, too; only security bugs are fixed now
Joey
Still not convincing to use UTF-16 for in-memory presentation of strings on windows :)I wish Windows7 guys would extend their support of already existing #define of CP_UTF8 instead..
Pavel Radzivilovsky
@Pavel Radzivilovsky: I fail to see how your code, using UTF-8 everywhere, will protect you from bugs in the Windows API? I mean: You're copying/converting strings for all calls to the WinAPI that use them, and still, if there is a bug in the GUI, or the filesystem, or whatever system handled by the OS, the bug remains. Now, perhaps your code has a specific UTF-8 handling functions (search for substrings, etc.), but then, you could have written them to handle UTF-16 instead, and avoid all this bloated code (unless you're writing cross-platform code... There, UTF-8 could be a sensible choice)
paercebal
@Pavel Radzivilovsky: BTW, your writings about *"I believe that all other encodings will die eventually. This involves that MS-Windows, Java, ICU, python stop using it as their favorite."* and *"In particular, I think adding wchar_t to C++ was a mistake, and so are the unicode additions to C++Ox."* are either quite naive or very very arrogant. And this is coming from someone coding at home with a Linux and who is happy with the UTF-8 chars. To put it bluntly: **It won't happen**.
paercebal
@paercebal: If majority of the code is API calls, this is a very simple code. Typically, majority of code dealing with strings is libraries that treat them as cookies, and they are optimized for. Hence, the bloating argument fails. As for the 'favorite utf16' for ICU and python, this is very questionable: these tools use UTF-16 internally, and changing it as a part of the evolution is the easiest. Can happen on any major release, coz it doesn't break the interfaces.
Pavel Radzivilovsky
In ICU we already see more and more UTF-8 interfaces and optimizations. However, UTF-16 works perfectly well, and makes complicated lookup efficient, more than with UTF-8. We will not see ICU drop UTF-16 internally. UTF-16 in memory, UTF-8 on the wire and on disk. All is good.
Steven R. Loomis
@Steven, It looks like differentiating between wire and RAM is not a small thing as it may seem. BTW, comparison is cheaper with UTF8. I agree that ICU is certainly a major player on this market, and there's no need to "drop" support of anything. The simplification of application design and testing with UTF8 is exactly what will, in my humble opinion, drive UTf-16 to extinction, and the sooner the better.
Pavel Radzivilovsky
@Pavel Radzivilovsky I meant, drop UTF-16 as the internal processing format. Can you expand on 'not a small thing'? And, anyways, UTF-16/UTF-8/UTF-32 have a 1:1:1 mapping. I'm much more interested in seeing non-Unicode encodings die. As far as UTF-8 goes for simplification, you say "they can just pass strings as char*"- right, and then they assume that the char* is some ASCII-based 8-bit encoding. Plenty of errors creep in when toupper(), etc, is used on UTF-8. It's not wonderful, but it is helpful.
Steven R. Loomis
@Steve First and foremost I agree about non-unicode. There's no argument about that. Practically, it already happened, they are already dead, in this exact sense: any non-unicode operation on a string is considered a bug just like any other software bug, or a 'text crime' in my company's slang. It is true that char* is misleading many into unicode bugs as well. Good luck with toupper() a UTF-8 string, or, say, with assuming that ICU toupper does not change the number of characters (as in german eszet converting to SS). After the standard has been established, there's no more reason to bug.
Pavel Radzivilovsky
@Steve, 2; and then we come to a more subtle thing, which is everything around human engineering and safety and designing proper way of work for a developer to do less and for the machine to do more. This is exactly where UTF-16 doesn't fit. Most applications do not reverse or even sort strings. Most often strings are treated as cookies, such as a file name here and there, concatenated here and there, embedded programming languages such as SQL and other really simple transformations. In this world, there's very little reason to have different format in RAM than on the wire.
Pavel Radzivilovsky
A: 

UTF-16? definitely harmful. Just my grain of salt here, but there are exactly three acceptable encodings for text in a program:

  • ASCII: when dealing with low level things (eg: microcontrollers) that can't afford anything better
  • UTF8: storage in fixed-width media such as files
  • integer codepoints ("CP"?): an array of the largest integers that are convenient for your programming language and platform (decays to ASCII in the limit of low resorces). Should be int32 on older computers and int64 on anything with 64-bit addressing.

  • Obviously interfaces to legacy code use what encoding is needed to make the old code work right.

David X
Unicode guarantees there will be no codepoints above `U+10FFFF`. You are talking about UTF-32/UCS-4 (they are identical). If you are thinking about speed, 32->64 is not 16->32; int64 is not faster for 64-processors.
Simon Buchan
@simon buchan, the `U+10ffff` max will go out the window when (not if) they run out of codepoints. That said, useing int32 on a p64 system for speed is probably safe, since i doubt they'll exceed `U+ffffffff` before you're forced to rewrite your code for 128 bit systems around 2050. (That is the point of "use the largest int that is convenient" as opposed to "largest available" (which would probably be int256 or bignums or something).)
David X
@David: Unicode 5.2 encodes 107,361 codepoints. There are 867,169 unused codepoints. "when" is just silly. A Unicode codepoint is *defined* as a number from 0 to 0x10FFFF, a property which UTF-16 depends upon. (Also 2050 seems much to low an estimate for 128 bit systems when a 64-bit system can hold the entirety of the Internet in it's address space.)
Simon Buchan
@Simon, yes, I was thinking 2050 sounded a bit low for either ETA, my point was that yes, "when" is silly, but it *will* happen. My point in the original answer, however, was to use an array of ints of whatever size is needed for the largest codepoint you expect to handle. (And yes, I did forget that most p64 systems still use int32 as a primary integer type. I'm not sure why.)
David X
@David: Your "when" was referring to running out of Unicode codepoints, not a 128-bit switch which, yes, will be in the next few centuries. Unlike memory, there is no exponential growth of characters, so the Unicode Consortium has *specifically* guaranteed they will *never* allocate a codepoint above `U+10FFFF`. This really is one of those situations when 21 bits *is* enough for anybody.
Simon Buchan
@Simon Buchan: At least until first contact. :)
dalle
+18  A: 

Unicode codepoints are not characters! Sometimes they are not even glyphs (visual forms).

Some examples:

  • Roman numeral codepoints like "ⅲ". (A single character that looks like "iii".)
  • Accented characters like "á", which can be represented as either a single combined character "\u00e1" or a character and separated diacritic "\u0061\u0301".
  • Characters like Greek lowercase sigma, which have different forms for middle ("σ") and end ("ς") of word positions, but which should be considered synonyms for search.
  • Unicode discretionary hyphen U+00AD, which might or might not be visually displayed, depending on context, and which is ignored for semantic search.

The only ways to get Unicode editing right is to use a library written by an expert, or become an expert and write one yourself. If you are just counting codepoints, you are living in a state of sin.

Daniel Newby
A: 

UTF-8 is definitely the way to go, possibly accompanied by UTF-32 for internal use in algorithms that need high performance random access (but that ignores combining chars).

Both UTF-16 and UTF-32 (as well as their LE/BE variants) suffer of endianess issues, so they should never be used externally.

Tronic
Constant time random access is possible with UTF-8 too, just use code units rather than code points.Maybe you need real random code point access, but I've never seen a use case, and you're just as likely to want random grapheme cluster access instead.
Rhamphoryncus
+2  A: 

I wouldn't necessarily say that UTF-16 is harmful. It's not elegant, but it serves its purpose of backwards compatibility with UCS-2, just like GB18030 does with GB2312, and UTF-8 does with ASCII.

But making a fundamental change to the structure of Unicode in midstream, after Microsoft and Sun had built huge APIs around 16-bit characters, was harmful. The failure to spread awareness of the change was more harmful.

dan04
UTF-8 is a superset of ASCII, but UTF-16 is NOT a superset of UCS-2. Although almost a superset, a correct encoding of UCS-2 into UTF-8 results in the abomination known as CESU-8; UCS-2 doesn't have surrogates, just ordinary code points, so they must be translated as such.The real advantage of UTF-16 is that it's easier to upgrade a UCS-2 codebase than a complete rewrite for UTF-8. Funny, huh?
Rhamphoryncus
Sure, technically UTF-16 isn't a superset of UCS-2, but when were U+D800 to U+DFFF ever *used* for anything except UTF-16 surrogates?
dan04
Doesn't matter. Any processing other than blindly passing through the bytestream requires you to decode the surrogate pairs, which you can't do if you're treating it as UCS-2.
Rhamphoryncus
A: 

Anyone consider this a deja vu from when DBCS had the same problems? What about UTF-8 programs that don't really handle 4-byte chars properly? It is why Windows do not support it as the ANSI codepage. One last thing, what version of Windows did you try this on? I just tried this myself on Chinese Windows 2000 (the first version of Windows that claims to support UTF-16) and the standard edit control do handle it correctly.

Yuhong Bao
This happens on Windows XP. Also you may accidentally copied a character that inside BMP. Believe me it happens - a lot. Now, I never found any UTF-8 enabled software that wasn't able to deal with 4 chars. Because if you already deal with variable length (and this means you are using anithing non-in-ascii) then generally you'll do it right as you respect variable length. This does not happens in case of UTF-16 as 95% of all programmers are sure that UTF-16 is fixed length encoding and even they know it they almost never checks the application with text outside of BMP as it is quite rare.
Artyom
+1  A: 

Someone said UCS4 and UTF-32 were same. No so, but I know what you mean. One of them is an encoding of the other, though. I wish they'd though to specify endianness from the first so we wouldn't have the endianess battle fought out here too. Couldn't they have seen that coming? At least utf-8 is the same everywhere (unless someone is following the original spec with 6-bytes). Sigh. If you use utf-16 you HAVE to include handling for multibyte chars. You can't go to the Nth character by indexing 2N into a byte array. You have to walk it, or have character indices. Otherwise you've written a bug. The current draft spec of C++ says that utf-32 and utf16 can have little-endian, big-endian, and unspecified variants. Really? If Unicode had specified that everyone had to do little-endian from the beginngin then it would have all been simpler. (I would have been fine with big-endian as well.) Instead, some people implemented it one way, some the other, and now we're stuck with silliness for nothing. Sometimes it's embarrassing to be a software engineer.

Patrick Horgan
Unspecified endianess is supposed to include BOM as the first character, used for determining which way the string should be read. UCS-4 and UTF-32 indeed are the same nowadays, i.e. a numeric UCS value between 0 and 0x10FFFF stored in a 32 bit integer.
Tronic