Why the creators of Windows and Linux systems chose different ways to support Unicode?

UTF-16 is pretty much a loss, the worst of both worlds. It is neither compact (for the typical case of ASCII characters), nor does it map each code unit to a character. This hasn't really bitten anyone too badly yet since characters outside of the Basic Multilingual Plane are still rarely-used, but it sure is ugly.

POSIX (Linux et al) has some w APIs too, based on the wchar_t type. On platforms other than Windows this typically corresponds to UTF-32 rather than UTF-16. Which is nice for easy string manipulation, but is incredibly bloated.

But in-memory APIs aren't really that important. What causes much more difficulty is file storage and on-the-wire protocols, where data is exchanged between applications with different charset traditions.

Here, compactness beats ease-of-indexing; UTF-8 is clearly proven the best format for this by far, and Windows's poor support of UTF-8 causes real difficulties. Windows is the last modern operating system to still have locale-specific default encodings; everyone else has moved to UTF-8 by default.

Whilst I seriously hope Microsoft will reconsider this for future versions, as it causes huge and unnecessary problems even within the Windows-only world, it's understandable how it happened.

The thinking back in the old days when WinNT was being designed was that UCS-2 was it for Unicode. There wasn't going to be anything outside the 16-bit character range. Everyone would use UCS-2 in-memory and naturally it would be easiest to save this content directly from memory. This is why Windows called that format “Unicode”, and still to this day calls UTF-16LE just “Unicode” in UI like save boxes, despite it being totally misleading.

UTF-8 wasn't even standardised until Unicode 2.0 (along with the extended character range and the surrogates that made UTF-16 what it is today). By then Microsoft were on to WinNT4, at which point it was far too late to change strategy. In short, Microsoft were unlucky to be designing a new OS from scratch around the time when Unicode was in its infancy.

So why Java uses UTF-16? It's a much later work.

Michal Czardybon 2010-05-28 11:37:47

Probably for better integration into Windows. A typical "good" (at least, a very valid one) reason for many bad design decisions.

ypnos 2010-05-28 12:09:13

I think Windows compatibility has a lot to do with it. (Python also uses UTF-16 for unicode-strings when running on Windows.) However there's also a simple element of bad design with Unicode in Java: there are a lot of places where Java defers to a system ‘default encoding’ which is almost never the right thing. Java encourages you to write brittle, non-portable charset code.

bobince 2010-05-28 12:11:16

the reason MS went with UTF-16 is that at the time that meant you had a fixed sized character, then they added more stuff to Unicode and it turned out 16 bits wasn't enough for all the codepoints so now Windows is stuck with with the worst of both worlds, variable length character encoding and wasted space, and loss of compatibility with ascii strings. This means Windows ends up doing UTF-8 -> UTF-16 and vice versa an a lot... (java does this too, FOR ALL STRINGS... UHG slow, at least on windows you can defer the conversion until you absolutely have too)

Spudd86 2010-05-28 13:35:23

also the Linux kernel didn't really need to much of anything to support UTF8 (Linux file names can contain any byte except a 0) and other than file names the kernel doesn't really deal with text that could be anything other than ascii.(Of course this gets tricky when you want to use a filesystem like NTFS where you do HAVE to care about the encoding... since it uses UTF-16... and Linux does not...)

Spudd86 2010-05-28 13:39:37

@Michael Czardybon: Actually, Java was fairly contemporary with WindowsNT. I started using it in late 1996 - I would argue that a lot of thinking at the time was that UTF-16 was still the way to go. If you figure development of the language started around 1994, it sort of makes sense. @spudd86: they knew when it was created that 16bits wasn't big enough for every character ever written - the argument was that it would be enough for the vast majority of characters currently in use. Yea, EPIC-fail. :-/

Chris Kaminski 2010-05-30 03:25:03

well Japan and China use the stuff outside the BMP pretty frequently...

Spudd86 2010-05-28 13:36:06

I wouldn't say frequently. The BMP contains the ideographs that were in the previously-standard national character sets (Shift-JIS, Big5, GB etc) at the time, so the additional characters in the SMP are ones that CJK users couldn't use at all before Unicode 3.1. These are mostly historical characters of academic interest, IMEs don't let you type them directly, and font support is still very weak.

bobince 2010-05-28 13:47:25

ansaurus

tags:

views:

answers:

Why the creators of Windows and Linux systems chose different ways to support Unicode?

related questions