What's the difference between unicode and utf8?

+1 A:

The development of Unicode was aimed at creating a new standard for mapping the characters in a great majority of languages that are being used today, along with other characters that are not that essential but might be necessary for creating the text. UTF-8 is only one of the many ways that you can encode the files because there are many ways you can encode the characters inside a file into Unicode.

Source:

http://www.differencebetween.net/technology/difference-between-unicode-and-utf-8/

Trufa 2010-10-17 02:19:47

+7 A:

It's not that simple.

UTF-16 is a 16-bit, variable-width encoding. Simply calling something "Unicode" is ambiguous, since "Unicode" refers to an entire set of standards for character encoding. Unicode is not an encoding!

http://en.wikipedia.org/wiki/Unicode#Unicode_Transformation_Format_and_Universal_Character_Set

and of course, the obligatory Joel On Software - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) link.

Matt Ball 2010-10-17 02:22:37

+1 A:

It's weird, Unicode is a standard not an encoding. As it is possible to specify the endianness I guess it's effectively UTF-16 or maybe 32.

Where does this menu provide from ?

MatTheCat 2010-10-17 02:22:40

From text editor called editplus.

ollydbg 2010-10-17 02:49:15

+1 A:

In addition to Trufa's comment, Unicode explicitly isn't UTF16. When they were first looking into Unicode, it was speculated that a 16bit integer might be enough to store any code, but in practice that turned out not to be the case. However, UTF16 is another valid encoding of Unicode - alongside the 8bit and 32bit variants - and I believe is the encoding that Microsoft use in memory at runtime on the NT-derived operating systems.

Tommy 2010-10-17 02:26:12

So for visual studio,`Unicode=UTF16` holds,right?

ollydbg 2010-10-17 02:50:31

@ollydbg, it is true that UTF-16 is the natural representation of Unicode in Windows, but that does not make them identical.

Mark Ransom 2010-10-17 03:08:33

+8 A:

most editors support save as ‘Unicode’ encoding actually.

This is an unfortunate misnaming perpetrated by Windows.

Because Windows uses UTF-16LE encoding internally as the memory storage format for Unicode strings, it considers this to be the natural encoding of Unicode text. In the Windows world, there's ANSI strings (the system codepage on the current machine, subject to total unportability) and there Unicode strings (stored internally as UTF-16LE).

This was all devised in the early days of Unicode, before we realised that UCS-2 wasn't enough, and before UTF-8 was invented. This is why Windows's support for UTF-8 is all-round poor.

This misguided naming scheme became part of the user interface. A text editor that uses Windows's encoding support to provide a range of encodings will automatically and inappropriately describe UTF-16LE as “Unicode”, and UTF-16BE, if provided, as “Unicode big-endian”.

(Other editors that do encodings themselves, like Notepad++, don't have this problem.)

If it makes you feel any better about it, ‘ANSI’ strings aren't based on any ANSI standard, either.

bobince 2010-10-17 02:57:30

+1+1+1+1+1+1+1+1

BalusC 2010-10-17 03:10:35

Matt Ball 2010-10-17 04:15:22

To be fair, at the time Windows (NT branch) was released, the only Unicode encoding form was the one with 16-bit code units which is today called UTF-16.

Nemanja Trifunovic 2010-10-18 13:28:12

Actually, the Unicode encoding that was used in NT4 was UCS-2, which is not the same as UTF-16. UCS-2 only supports the BMP (Unicode codepoints U+0000 - U+FFFF). UTF-16, on the other hand, supports all known Unicode codepoints (U+0000 - U+10FFFF) via surrogates. Windows was switched from UCS-2 to UTF-16 in Win2K.

Remy Lebeau - TeamB 2010-10-19 04:56:15

+3 A:

There's a lot of misunderstanding being displayed here. Unicode isn't an encoding, but the Unicode standard is devoted primarily to encoding anyway.

ISO 10646 is the international character set you (probably) care about. It defines a mapping between a set of named characters (e.g., "Latin Capital Letter A" or "Greek small letter alpha") and a set of code points (a number assigned to each -- for example, 61 hexadecimal and 3B1 hexadecimal for those two respectively; for Unicode code points, the standard notation would be U+0061 and U+03B1).

At one time, Unicode defined its own character set, more or less as a competitor to ISO 10646. That was a 16-bit character set, but it was not UTF-16; it was known as UCS-2. It included a rather controversial technique to try to keep the number of necessary characters to a minimum (Han Unification -- basically treating Chinese, Japanese and Korean characters that were quite a bit alike as being the same character).

Since then, the Unicode consortium has tacitly admitted that that wasn't going to work, and now concentrate primarily on ways to encode the ISO 10646 character set. The primary methods are UTF-8, UTF-16 and UCS-4 (aka UTF-32). Those (except for UTF-8) also have LE (little endian) and BE (big-endian) variants.

By itself, "Unicode" could refer to almost any of the above (though we can probably eliminate the others that it shows explicitly, such as UTF-8). Unqualified use of "Unicode" probably happens the most often on Windows, where it will almost certainly refer to UTF-16. Early versions of Windows NT adopted Unicode when UCS-2 was current. After UCS-2 was declared obsolete (around Win2k, if memory serves), they switched to UTF-16, which is the most similar to UCS-2 (in fact, it's identical for characters in the "basic multilingual plane", which covers a lot, including all the characters for most Western European languages).

Jerry Coffin 2010-10-17 03:05:48

Ok, by why did MS perpetuate this into [.NET](http://msdn.microsoft.com/en-US/library/system.text.unicodeencoding%28v=VS.80%29.aspx)? Wasn't .NET a post-Win2k invention?

GregS 2010-10-17 03:59:36

@GregS: About all I can say is that fans of .NET would undoubtedly flag my honest opinion of the design of .NET as offensive (in fact, even though I toned it down a lot, that's already happened).

Jerry Coffin 2010-10-17 04:06:22

A:

UTF-16 and UTF-8 are both encodings of Unicode. They are both Unicode; one is not more Unicode than the other.

Don't let an unfortunate historical artifact from Microsoft confuse you.

Mark Ransom 2010-10-17 03:19:53

ansaurus

tags:

views:

answers:

What's the difference between unicode and utf8?

related questions