views:

93

answers:

2

As far as I know Linux chose backward compatibility of UTF-8, whereas Windows added completely new API functions for UTF-16 (ending with "W"). Could these decisions be different? Which one proved better?

+4  A: 

Windows chose to support Unicode with UTF-16 and the attendant Ascii/Unicode functions way way way way WAAAAAAY back in the early 90's (Windows NT 3.1 came out in 1993), before Linux ever had the notion of Unicode support.

Linux has been able to learn from best practices built on Windows and other Unicode capable platforms.

Many people would agree today that UTF-8 is the better encoding for size reasons unless you know you're absolutely going to be dealing with lots of double-byte characters - exclusively - where UTF-16 is more space efficient.

Chris Kaminski
So why Java uses UTF-16? It's a much later work.
Michal Czardybon
Probably for better integration into Windows. A typical "good" (at least, a very valid one) reason for many bad design decisions.
ypnos
I think Windows compatibility has a lot to do with it. (Python also uses UTF-16 for unicode-strings when running on Windows.) However there's also a simple element of bad design with Unicode in Java: there are a lot of places where Java defers to a system ‘default encoding’ which is almost never the right thing. Java encourages you to write brittle, non-portable charset code.
bobince
the reason MS went with UTF-16 is that at the time that meant you had a fixed sized character, then they added more stuff to Unicode and it turned out 16 bits wasn't enough for all the codepoints so now Windows is stuck with with the worst of both worlds, variable length character encoding and wasted space, and loss of compatibility with ascii strings. This means Windows ends up doing UTF-8 -> UTF-16 and vice versa an a lot... (java does this too, FOR ALL STRINGS... UHG slow, at least on windows you can defer the conversion until you absolutely have too)
Spudd86
also the Linux kernel didn't really need to much of anything to support UTF8 (Linux file names can contain any byte except a 0) and other than file names the kernel doesn't really deal with text that could be anything other than ascii.(Of course this gets tricky when you want to use a filesystem like NTFS where you do HAVE to care about the encoding... since it uses UTF-16... and Linux does not...)
Spudd86
@Michael Czardybon: Actually, Java was fairly contemporary with WindowsNT. I started using it in late 1996 - I would argue that a lot of thinking at the time was that UTF-16 was still the way to go. If you figure development of the language started around 1994, it sort of makes sense. @spudd86: they knew when it was created that 16bits wasn't big enough for every character ever written - the argument was that it would be enough for the vast majority of characters currently in use. Yea, EPIC-fail. :-/
Chris Kaminski
+6  A: 

UTF-16 is pretty much a loss, the worst of both worlds. It is neither compact (for the typical case of ASCII characters), nor does it map each code unit to a character. This hasn't really bitten anyone too badly yet since characters outside of the Basic Multilingual Plane are still rarely-used, but it sure is ugly.

POSIX (Linux et al) has some w APIs too, based on the wchar_t type. On platforms other than Windows this typically corresponds to UTF-32 rather than UTF-16. Which is nice for easy string manipulation, but is incredibly bloated.

But in-memory APIs aren't really that important. What causes much more difficulty is file storage and on-the-wire protocols, where data is exchanged between applications with different charset traditions.

Here, compactness beats ease-of-indexing; UTF-8 is clearly proven the best format for this by far, and Windows's poor support of UTF-8 causes real difficulties. Windows is the last modern operating system to still have locale-specific default encodings; everyone else has moved to UTF-8 by default.

Whilst I seriously hope Microsoft will reconsider this for future versions, as it causes huge and unnecessary problems even within the Windows-only world, it's understandable how it happened.

The thinking back in the old days when WinNT was being designed was that UCS-2 was it for Unicode. There wasn't going to be anything outside the 16-bit character range. Everyone would use UCS-2 in-memory and naturally it would be easiest to save this content directly from memory. This is why Windows called that format “Unicode”, and still to this day calls UTF-16LE just “Unicode” in UI like save boxes, despite it being totally misleading.

UTF-8 wasn't even standardised until Unicode 2.0 (along with the extended character range and the surrogates that made UTF-16 what it is today). By then Microsoft were on to WinNT4, at which point it was far too late to change strategy. In short, Microsoft were unlucky to be designing a new OS from scratch around the time when Unicode was in its infancy.

bobince
well Japan and China use the stuff outside the BMP pretty frequently...
Spudd86
I wouldn't say frequently. The BMP contains the ideographs that were in the previously-standard national character sets (Shift-JIS, Big5, GB etc) at the time, so the additional characters in the SMP are ones that CJK users couldn't use at all before Unicode 3.1. These are mostly historical characters of academic interest, IMEs don't let you type them directly, and font support is still very weak.
bobince