What is Unicode, UTF-8, UTF-16?

views:

272

answers:

+11 Q:

What is Unicode, UTF-8, UTF-16?

What's the basis for Unicode and why the need for UTF-8 or UTF-16? I have researched this on Google and searched here as well but it's not clear to me.

In VSS when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?

Please explain in simple terms.

+11 A:

Sounds like you need to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets! It's a very good explanation of what's going on.

Brian Agnew 2010-02-11 00:14:39

+1 for the ultimate source :)

John Weldon 2010-02-11 00:16:16

Nice! Thanks Brian.

SoftwareGeek 2010-02-11 00:41:25

+1 A:

unicode is a character set to accomodate millions of different character points.

utf8 is an encoding scheme way to store unicode in less bytes than 4 per character point where possible.

John Weldon 2010-02-11 00:15:10

Unicode is *not* an encoding scheme, at least not on the conventional sense of the word. That's *why* encoding schemes such as UTF-8, UTF-16, UTF-32, UTF-7 and UTF-7,5 are needed in *addition* to Unicode. Unicode is a character set. It also defines a mapping between characters and codepoints, but those codepoints are abstract entities, not concrete representations, which is why I wouldn't call that an encoding but rather a mapping.

Jörg W Mittag 2010-02-11 10:08:39

Thanks @Jörg W Mittag

John Weldon 2010-02-11 15:18:54

Shorter introduction from my blog. It was obtained from Joel's post, but applied to a specific issue.

Stefano Borini 2010-02-11 15:35:30

This FAQ from the official Unicode web site has some answers for you.

Nemanja Trifunovic 2010-02-11 16:10:59

Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.

Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.

Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.

dan04 2010-07-05 05:04:27

ansaurus

tags:

views:

answers:

What is Unicode, UTF-8, UTF-16?

related questions