views:

272

answers:

5

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

Please explain in simple terms.

+11  A: 

Sounds like you need to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets! It's a very good explanation of what's going on.

Brian Agnew
+1 for the ultimate source :)
John Weldon
Nice! Thanks Brian.
SoftwareGeek
+1  A: 

unicode is a character set to accomodate millions of different character points.

utf8 is an encoding scheme way to store unicode in less bytes than 4 per character point where possible.

John Weldon
Unicode is *not* an encoding scheme, at least not on the conventional sense of the word. That's *why* encoding schemes such as UTF-8, UTF-16, UTF-32, UTF-7 and UTF-7,5 are needed in *addition* to Unicode. Unicode is a character set. It also defines a mapping between characters and codepoints, but those codepoints are abstract entities, not concrete representations, which is why I wouldn't call that an encoding but rather a mapping.
Jörg W Mittag
Thanks @Jörg W Mittag
John Weldon
A: 

Shorter introduction from my blog. It was obtained from Joel's post, but applied to a specific issue.

Stefano Borini
A: 

This FAQ from the official Unicode web site has some answers for you.

Nemanja Trifunovic
A: 

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.

Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.

dan04