views:

1533

answers:

11

Being a application developer, do I need to know Unicode?

+4  A: 

This article from Joel Spolsky should help you a lot.

jdecuyper
+51  A: 

Read Joel's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Galwegian
Exactly what I was thinking when I read the question :)
VVS
+4  A: 

Unicode is an industry agreed standard for consistently representing text that has capacity to represent the World's character systems. All developers need to know about it, as Globalization is a growing concern.

jmcd
+9  A: 

At the risk of just adding another link, unicode.org is a spectacular resource.

In short, it's a replacement for ASCII that's designed to handle, literally, every character ever used by humans. Unicode has everal encoding schemes to handle all those characters - UTF-8, which is more or less the standard these days, works really hard to stay a single byte per character, and is identical to ASCII for the first 7 bits.

(As an addendum, there's a popular misconception amongst programmers that you only need to know about Unicode if you're going to be doing internationalization. While that's certainly one use, it's not the only one. For example, I'm working on a project that will only ever use English text - but with a huge number of fancy math symbols. Moving the whole project over to be fully Unicode solved more problems than I can count.)

Electrons_Ahoy
+25  A: 
erickson
Good set of links - thanks.
Jonathan Leffler
Yeah, that's excellent.
Electrons_Ahoy
+1  A: 

Unicode is a standard that enumerates characters, and gives them unique numeric IDs (called "code points"). It includes a very large, and growing, set of characters for most modern written languages, and also a lot of exotic things like ancient Greek musical notation.

Unlike other character encoding schemes (like ASCII or the ISO-8859 standards), Unicode does not say anything about representing these characters in bytes; it just gives a universal set of IDs to characters. So it is wrong to say that Unicode is "a 16-bit replacement for ASCII".

There are various encoding schemes that can representing arbitrary Unicode characters in bytes, including UTF-8, UTF-16, and others.

eli
+2  A: 

Unicode is a character set, that other than ASCII (which contains only letters for US English, 127 characters, one third of them actually being non-printable control characters) contains roughly 2 million characters, including characters of every language known (Chinese, Russian, Greek, Arabian, etc.) and some languages you have probably never even heard of (even lots of dead language symbols not in use anymore, but useful for archiving ancient documents).

So instead of dealing with dozens of different character encodings, you have one encoding for all of them (which also makes it easier to mix characters from different languages within a single text string, as you don't need to switch the encoding somewhere in the middle of a text string). Actually there is still plenty of room left, we are far from having all 2 mio characters in use; the Unicode Consortium could easily add symbols for another 100 languages without even starting to fear running out of symbol space.

Pretty much any book in any language you can find in a library today can be expressed in Unicode. Unicode is the name of the encoding itself, how it is expressed as "bytes" is a different issue. There are several ways to write Unicode characters like UTF-8 (one to six bytes represent a single character, depending on character number, US English is almost always one byte, other Roman languages might be two or three, Chinese/Japanese might be more), UTF-16 (most characters are two byte, some rarely used ones are four byte) and UTF-32, every character is four byte. There are others, but these are the dominant ones.

Unicode is the default encoding for many newer OSes (in Mac OS X almost anything is Unicode) and programming languages (Java uses Unicode as default encoding, usually UTF-16, I heard Python does as well and will use or already does use UTF-32). If you ever plan to write an app that should display, store, or process anything other than plain English text, you'd better get used to Unicode, the sooner the better.

Mecki
+2  A: 

One (open) source of code for handling Unicode is ICU - Internationalization Components for Unicode. It includes ICU4J for Java and ICU4C for C and C++ (presents C interface; uses C++ compiler).

Jonathan Leffler
A: 

You don't need to learn unicode to use it, it's a hell of complex norm. You just need to know the main issues and how your programming tools deal with it. To learn that, check the Galwegian's link and your programming language and ide documentation.

E.G :

You can convert any caracter from latin-1 to unicode but it doesn't work the other way for all caracters. PHP let you now that some function (like stristr) does not work with unicode. Python declare unicode string this way : u"Hello World".

That's the kind of thin you must know.

Knowing that, if you do not have a GOOD reason to not use unicode, then just use it.

e-satis
A: 

Here you can find a grate guide

http://www.joelonsoftware.com/articles/Unicode.html

Cesar