Why isn't everything we do in Unicode?

views:

1575

answers:

+28 Q:

Why isn't everything we do in Unicode?

Given that Unicode has been around for 18 years, why are there still apps that don't have Unicode support? Even my experiences with some operating systems and Unicode have been painful to say the least. As Joel Spolsky pointed out in 2003, it's not that hard. So what's the deal? Why can't we get it together?

+5 A:

Laziness, ignorance.

Matt Olenik 2009-06-11 03:46:06

+1 A:

More overhead, space requirements.

Cuga 2009-06-11 03:46:25

That's an interesting point. Where I work, we often encode strings using UTF-8 to "compress" them. Of course, this only works because we speak English. ;)

Peter Ruderman 2009-06-11 03:59:52

Really? That seems like a poorly thought out reason to me. Unless you are running on an embedded system, the space requirements and overhead should be a lesser consideration to internationalization.

docgnome 2009-06-11 04:02:32

This is by no means universal. We use it for our human readable network protocols, not for general strings.

Peter Ruderman 2009-06-11 04:10:48

@docgnome: There are projects where you know you'll *never* have to account for internationalization. For these, it makes sense to NOT use Unicode.

Cuga 2009-06-11 15:00:12

@Cuga: Internationalization is but one reason to use Unicode. As soon as a program deals with strings (for person names, addresses, file or directory names, whatever) it does no longer really make sense to NOT use Unicode. See the anecdote in the last paragraph of the accepted answer.

mghie 2009-06-16 11:05:24

+10 A:

Probably because people are used to ASCII and a lot of programming is done by native English speakers.

IMO, it's a function of collective habit, rather than conscious choice.

Eric 2009-06-11 03:47:59

Even programming is done in English by non-native speakers - and the languages themselves are in English (that is, "do ... while" and "until" and "do ... end" and so on). Even the programmers program in English. So I would venture that it's no surprise that even non-English-speaking programmers may not utilize Unicode that much.

David 2009-06-13 01:12:34

I suspect it's because software has such strong roots in the west. UTF-8 is a nice, compact format if you happen to live in America. But it's not so hot if you live in Asia. ;)

Peter Ruderman 2009-06-11 03:49:42

I think you mean "ASCIII is [nice in] America". UTF-8 can represent every Unicode character, but is backwards-compatible with ASCII. At least the people behind XML made UTF-8/16 the default encoding. I wish more people could get their encoding right for web pages, though...

Quinn Taylor 2009-06-11 04:42:48

Actually, UTF-8 is not always worse than UTF-16. UTF-16 uses either 2 or 4 bytes per character; UTF-8 uses 1-4 bytes per character. That said, the characters from U+0800 to U+FFFF require 2 bytes in UTF-16 but 3 bytes in UTF-8, and those code points cover a number of Asian languages.

Jonathan Leffler 2009-06-11 04:50:00

No, I meant UTF-8, preicesly because it is backwardsly compatible. Any text file encoded in ASCII is also encoded in UTF-8.

Peter Ruderman 2009-06-11 12:41:54

+17 A:

Many product developers don't consider their apps being used in Asia or other regions where Unicode is a requirement.
Converting existing apps to Unicode is expensive and usually driven by sales opportunities.
Many companies have products maintained on legacy systems and migrating to Unicode means a totally new development platform.
You'd be surprised how many developers don't understand the full implications of Unicode in a multi-language environment. It's not just a case of using wide strings.

Bottom line - cost.

Gerard 2009-06-11 03:54:53

This sounds like a more reasonable answer to my question that most of the others. Exchange sales opportunities with whatever metrics that open source apps use and I think this, combined with what Eric said, is probably the reason.

docgnome 2009-06-11 04:10:07

Unicode requires more work (thinking), you usually only get paid for what is required so you go with the fastest less complicated option.

Well that's from my point of view. I guess if you expect code to use std::wstring hw(L"hello world") you have to explain how it all works that to print wstring you need wcout : std::wcout << hw << std::endl; (I think), (but endl seems fine ..) ... so seems like more work to me - of course if I was writing international app I would have to invest into figuring it out but until then I don't (as I suspect most developers).

I guess this goes back to money, time is money.

stefanB 2009-06-11 03:55:47

That assumes that you get paid. What about open source apps? What about languages? Why aren't strings in all languages in unicode by default? My point is that it _shouldn't_ be more work.

docgnome 2009-06-11 03:58:13

+1 A:

It's simple. Because we only have ASCII characters on our keyboards, why would we ever encounter, or care about characters other than those? It's not so much an attitude as it is what happens when a programmer has never had to think about this issue, or never encountered it, perhaps doesn't even know what unicode is.

edit: Put another way, Unicode is something you have to think about, and thinking is not something most people are interested in doing, even programmers.

Breton 2009-06-11 04:06:02

I would think that we would care about those other characters because other people care about those characters. Should we not build UIs that any idiot can use simply because we can use something less polished and more complex?

docgnome 2009-06-11 04:22:56

I'm sorry, on my keyboard there are definitely characters that are not ASCII. Umlaut chars? Chinese or cyrillic keyboards? I'd assume that your assertion is wrong for most of the PCs around the world...

mghie 2009-06-11 04:27:42

My assertion is true for the grand majority of *American* keyboards, and computer users. I don't know if you noticed, but that's kind of where most of the world's software comes from.

Breton 2009-06-11 04:34:32

"I would think that we would care about those other characters because other people care about those characters." What other people?

Breton 2009-06-11 04:36:04

Something I've noticed is that people around the world don't seem to realize the true extent of the ignorance most Americans have about people other than Americans.

Breton 2009-06-11 04:37:50

Ah, definitely an attitude problem, you nicely prove my point.

mghie 2009-06-11 04:39:39

Good point. Out of sight, out of mind. I'd add that our testers most often probably have those same keyboards and don't report Unicode related bug reports unless the company explicitly targets customers who use other languages regularly.

JohnFx 2009-07-20 18:03:46

+10 A:

The widespread availability of development tools for working with Unicode may not be as recent an event as you suppose. Working with Unicode was, until just a few years ago, a painful task of converting between character formats and dealing with incomplete or buggy implementations. You say it's not that hard, and as the tools improve that is becoming more true. But there are a lot of ways to trip up unless the details are hidden from you by good languages and libraries. Hell, just cutting and pasting unicode characters could be a questionable proposition a few years back. Developer education also took some time, and you still see people make a ton of really basic mistakes.

The Unicode standard weighs probably ten pounds. Even just an overview of it would have to discuss the subtle distinctions between characters, glyphs, codepoints, etc. Now think about ASCII. It's 128 characters. I can explain the entire thing to someone that knows binary in about 5 minutes.

I believe that almost all software should be written with full Unicode support these days, but it's been a long road to achieving a truly international character set with encoding to suit a variety of purposes, and it's not over just yet.

PeterAllenWebb 2009-06-11 04:12:31

Great answer. I don't really know what else to say.

docgnome 2009-06-11 04:27:36

I still find it difficult to put Unicode characters into my C++ source files.

Mark Ransom 2009-06-11 05:21:54

+5 A:

One huge factor is programming language support, most of which use a character set that fits in 8 bits (like ASCII) as the default for strings. Java's String class uses UTF-16, and there are others that support variants of Unicode, but many languages opt for simplicity. Space is so trivial of a concern these days that coders who cling to "space efficient" strings should be slapped. Most people simply aren't running on embedded devices, and even devices like cell phones (the big computing wave of the near future) can easily handle 16-bit character sets.

Another factor is that many programs are written only to run in English, and the developers (1) don't plan (or even know how) to localize their code for multiple languages, and (2) they often don't even think about handling input in non-Roman languages. English is the dominant natural language spoken by programmers (at least, to communicate with each other) and to a large extent, that has carried over to the software we produce. However, the apathy and/or ignorance certainly can't last forever... Given the fact that the mobile market in Asia completely dwarfs most of the rest of the world, programmers are going to have to deal with Unicode quite soon, whether they like it or not.

For what it's worth, I don't think the complexity of the Unicode standard is not that big of a contributing factor for programmers, but rather for those who must implement language support. When programming in a language where the hard work has already been done, there is even less reason to not use the tools at hand. C'est la vie, old habits die hard.

Quinn Taylor 2009-06-11 04:38:45

Great answer. I think you have a stray "not" however.

docgnome 2009-06-11 04:53:23

"Space is so trivial of a concern these days that coders who cling to 'space efficient' strings should be slapped" I agree completely.

docgnome 2009-06-11 05:00:10

Heh, gotta love the random downvoter that leaves no explanation. :-/

Quinn Taylor 2009-07-20 21:02:55

+4 A:

All operating systems until very recently were built on the assumption that a character was a byte. It's APIs were built like that, the tools were built like that, the languages were built like that.

Yes, it would be much better if everything I wrote was already... err... UTF-8? UTF-16? UTF-7? UTF-32? Err... mmm... It seems that whatever you pick, you'll annoy someone. And, in fact, that's the truth.

If you pick UTF-16, then all of your data, as in, pretty much the western world whole economy, stops being seamlessly read, as you lose the ASCII compatibility. Add to that, a byte ceases to be a character, which seriously break the assumptions upon which today's software is built upon. Furthermore, some countries do not accept UTF-16. Now, if you pick ANY variable-length encoding, you break some basic premises of lots of software, such as not needing to traverse a string to find the nth character, of being able to read a string from any point of it.

And, then UTF-32... well, that's four bytes. What was the average hard drive size or memory size but 10 years ago? UTF-32 was too big!

So, the only solution is to change everything -- software, utilites, operating systems, languages, tools -- at the same time to be i18n-aware. Well. Good luck with "at the same time".

And if we can't do everything at the same time, then we always have to keep an eye out for stuff which hasn't been i18n. Which causes a vicious cycle.

It's easier for end user applications than for middleware or basic software, and some new languages are being built that way. But... we still use Fortran libraries written in the 60s. That legacy, it isn't going away.

Daniel 2009-06-11 04:41:35

Well, it depends on what you consider very recently. To quote the Wikipedia entry for Windows NT: "Windows NT was one of the earliest operating systems to use Unicode internally." Its Ansi versions of API functions do all go through translation to the internal character encoding first. The version 3.1 was released in 1993. And with Windows 2000 / XP at the latest the NT line has entered mainstream.

mghie 2009-06-11 07:32:05

I don't mean having APIs supporting Unicode, read again. As for "recently", anything in the last 15 years.

Daniel 2009-06-11 19:43:49

Maybe *you* should read again. The whole NT line of OSs is built around Unicode, and the Ansi API is just a layer on top of the real thing. Your answer starts with an erroneous statement, and it doesn't get much better after that.

mghie 2009-06-15 13:21:47

Well, answer one thing. How many characters there are in name of a FAT filename in compatibility mode?

Daniel 2009-06-15 17:26:59

+4 A:

Because UTF-16 became popular before UTF-8 and UTF-16 is a pig to work with. IMHO

Peter Ericson 2009-06-11 04:47:11

I personally do not like how certain formats of unicode break it so that you can no longer do string[3] to get the 3rd character. Sure it could be abstracted out, but imagine how much slower a big project with strings, such as GCC would be if it had to transverse a string to figure out the nth character. The only option is caching where "useful" positions are and even then it's slow, and in some formats your now taking a good 5 bytes per character. To me, that is just ridiculous.

Earlz 2009-06-11 04:51:00

So your argument is to just forget internationalization because it would make string access slow? Now _that_ is ridiculous.

docgnome 2009-06-11 04:57:10

I'm not saying forget it.. I'm saying don't use it where it doesn't need to be... You can argue it needs to be everywhere, but if you never plan on translating your strings to other languages, unicode would really have no point.and I'm talking about a big project like GCC if it treated C files as unicode text files

Earlz 2009-06-11 14:54:47

+36 A:

Start with a few questions

How often...

do you need to write an application that deals with something else than ascii?
do you need to write a multi-language application?
do you write an application that has to be multi-language from its first version?
have you heard that Unicode is used to represent non-ascii characters?
have you read that Unicode is a charset? That Unicode is an encoding?
do you see people confusing UTF-8 encoded bytestrings and Unicode data?

Do you know the difference between a collation and an encoding?

Where did you first heard of Unicode?

At school? (really?)
at work?
on a trendy blog?

Have you ever, in your young days, experienced moving source files from a system in locale A to a system in locale B, edited a typo on system B, saved the files, b0rking all the non-ascii comments and... ending up wasting a lot of time trying to understand what happened? (did your editor mixed up things? the compiler? the system? the... ?)

Did you end up deciding that never again you will comment your code using non-ascii characters?

Have a look at what's being done elsewhere

Python

Did I mention on SO that I love Python? No? Well I love Python.

But until Python3.0, its Unicode support sucked. And there were all those rookie programmers, who at that time knew barely how to write a loop, getting UnicodeDecodeError and UnicodeEncodeError from nowhere when trying to deal with non-ascii characters. Well they basically got life-traumatized by the Unicode monster, and I know a lot of very efficient/experienced Python coders that are still frightened today about the idea of having to deal with Unicode data.

And with Python3, there is a clear separation between Unicode & bytestrings, but... look at how much trouble it is to port an application from Python 2.x to Python 3.x if you previously did not care much about the separation/if you don't really understand what is Unicode.

Databases, PHP

Do you know a popular commercial website that stores its international text as Unicode?

You will (perhaps) be surprised to learn that Wikipedia backend does not store its data using Unicode. All text is encoded in UTF-8 and is stored as binary data in the Database.

One key issue here is how to sort text data if you store it as Unicode codepoints. Here comes the Unicode collations, which define a sorting order on Unicode codepoints. But proper support for collations in Databases is missing/is in active development. (There are probably a lot of performance issues, too. -- IANADBA) Also, there is no widely-accepted standard for collations yet: for some languages, people don't agree on how words/letters/wordgroups should be sorted.

Have you heard of Unicode normalization? (Basically, you should convert your Unicode data to a canonical representation before storing it) Of course it's critical for Database storage, or local comparisons. But PHP for example only provides support for normalization since 5.2.4 which came out in August 2007.

And in fact, PHP does not completely supports Unicode yet. We'll have to wait PHP6 to get Unicode-compatible functions everywhere.

So, why isn't everything we do in Unicode?

Some people don't need Unicode.
Some people don't care.
Some people don't understand that they will need Unicode support later.
Some people don't understand Unicode.
For some others, Unicode is a bit like accessibility for webapps: you start without, and will add support for it later
A lot of popular libraries/languages/applications lack proper, complete Unicode support, not to mention collation & normalization issues. And until all items in your development stack completely support Unicode, you can't write a clean Unicode application.

Internet clearly helps spreading the Unicode trend. And it's a good thing. Initiatives like Python3 breaking changes help educating people about the issue. But we will have to wait patiently a bit more to see Unicode everywhere and new programmers instinctively using Unicode instead of Strings where it matters.

For the anecdote, because FedEx does not apparently support international addresses, the Google Summer of Code '09 students all got asked by Google to provide an ascii-only name and address for shipping. If you think that most business actors understand stakes behind Unicode support, you are just wrong. FedEx does not understand, and their clients do not really care. Yet.

NicDumZ 2009-06-11 06:46:20

I think your anecdote shows that there isn't really that first category (Some people don't need Unicode.) - they all fall into the second and third category.

mghie 2009-06-11 07:40:08

1) Like mghie said, these people are just plain wrong.2) Same, imho, goes for number two.3) Agreed! We need to bludgeon this into their heads! You can't call YAGNI on Unicode.4) Sure. But they can learn, so I don't think this should be a real impediment. And to some extend needing to understand it can be abstracted away. Or at least needing to master it can be abstracted away.5) Same as one.6) Ah ha! Now that is a good reason. Sounds like we need to get cracking!Great answer. (Though what do my /young/ day have to do with anything? Or am I just misreading that line?)

docgnome 2009-06-11 14:25:13

mmm the /young/ days anecdote was about the frightening and time-consuming experience of... character sets. For some reasons people stop at that first experience: you know... "OSs are not even able to deal with different locales, _wow_ it must be /so/ complicated to write Unicode-compatible applications". True, it's quite unrelated. But sometimes I hear that false reason too.

NicDumZ 2009-06-11 14:34:47

@NicDumz Ah. Gotcha. Thanks for being so responsive. Seriously, great answer. I was wondering which of the other answers I should accept as several are good. You're answer really brought it home though.

docgnome 2009-06-11 14:40:43

"FedEx does not understand, and their clients do not really care. Yet." This is actually kind of interesting. What kind of characters should be allowed to appear on an international shipping label? Should a FedEx facility located in Califonria be expected to deliver a package whith an address printed in arabic script or chineese?

PeterAllenWebb 2009-06-11 15:02:56

I would add: how many of programmers that use Unicode avare of: surragates in utf-16 (not Joel), know difference between character an code point...

Artyom 2009-06-14 05:26:58

Awesome post. Bravo!

JohnFx 2009-07-20 18:00:07

@PeterAllenWebb: I'd say you are allowed to laugh at FedEx if they write Kln on your package because you live in Köln and they only support ASCII. Same for Nîmes, Tromsø, Jyväskylä, Varaždin.

Ölbaum 2009-09-10 21:47:07

+1 A:

Tradition and attitude. ASCII and computers are sadly synonyms to many people.

However, it would be naïve to think that the rôle of Unicode, is only a matter of Exotic languages from Eurasia and other parts of the world. A rich text encoding has lots of meaning to bring even to a "plain" English text. Look in a book sometime.

kaizer.se 2009-07-20 17:56:35

+3 A:

Because for 99% of applications, Unicode support is not a checkbox on the customer's product comparison matrix.

Add to the equation:

It takes a conscious effort with almost no readily visible benefit.
Many programmers are afraid of it or don't understand it.
Management REALLY doesn't understand it or care about it, at least not until a customer is screaming about it.
The testing team isn't testing for Unicode compliance.
"We didn't localize the UI, so non-English speakers wouldn't be using it anyway."

JohnFx 2009-07-20 18:09:20

+1 A:

I would say there are mainly two reason. First one is simply that the Unicode support of your tools just isn't up to snuff. C++ still doesn't have Unicode support and won't get it until the next standard revision, which will take maybe a year or two to be finished and then another five or ten years to be in widespread use. Many other languages aren't much better and even if you finally have Unicode support, it might still be a more cumbersome to use then plain ASCII strings.

The second reason is in part what it causing the first issue, Unicode is hard, its not rocket science, but it gives you a ton of problems that you never had to deal with in ASCII. With ASCII you had a clear one byte == one glyph relationships, could address the Nth character of a string by a simple str[N], could just store all characters of the whole set in memory and so on. With Unicode you no longer can do that, you have to deal with different ways Unicode is encoded (UTF-8, UTF-16, ...), byte order marks, decoding errors, lots of fonts that have only a subset of characters which you would need for full Unicode support, more glyphs then you want to store in memory at a given time and so on.

ASCII could be understand by just looking at an ASCII table without any further documentation, with Unicode that is simply no longer the case.

Grumbel 2009-09-10 21:26:41

Someone else mentioned the nth character issue. Surely this can be abstracted away. Iirc, python manages this somehow so it's clearly doable. Surely the problems you describe can be abstracted away for the most part.

docgnome 2009-09-11 23:10:31

It can be abstracted away, but when you do it the easy way then you end up with O(N) instead O(1) and when you do it the hard way you might end up having to deal with both UTF-8 and UTF-32 in your code. And of course in languages like C, which doesn't have operator overloading things get a little ugly. None of these problems are of course very hard to solve, but they require a bit of head scratching in cases where it was completly trivial with plain ASCII and of course when you have lots of code dealing with raw 'char*' you are up to some rewriting.

Grumbel 2009-09-12 04:29:02

ansaurus

tags:

views:

answers:

Why isn't everything we do in Unicode?

Start with a few questions

Have a look at what's being done elsewhere

So, why isn't everything we do in Unicode?

related questions