UTF-8 or UTF-16 or UTF-32 or UCS-2

views:

248

answers:

+4 Q:

UTF-8 or UTF-16 or UTF-32 or UCS-2

Hi all

I am designing a new CMS but want to design it to fit all my future needs like Multilingual content so i was thinking Unicode (UTF-8) is the best solution

But with some search i got this article

http://msdn.microsoft.com/en-us/library/bb330962%28SQL.90%29.aspx#intlftrql2005_topic2

So i am now confused what to use now UTF-8 / UTF-16 / UTF-32 / UCS-2

which is better for Multilingual content and performance etc.

PS : i am using Asp.net and c# and SqlServer 2005

Thanks in advance

Quick note: basically everything can be represented in the unicode character set. UTF-8 is just one encoding that's able to represent all of the characters in this set.

UCS-2 is not really a thing to use anymore. It can't hold characters beyond U+FFFF.

Which of the remaining three depends on what kind of operations you want to do on the text. UTF-8 (usually, not always!) will take up less space on disk representing the same data, and is a strict superset of ASCII, so it might reduce the amount of transcoding needed. However, you can't index your string or find its length in constant time.

UTF-32 does allow you to find the length of the string and index it in constant time. It isn't a superset of ASCII like UTF-8 is. It does also require you to have 4 bytes per code point, but hey, disk space is cheap.

Aaron Gallagher 2010-08-13 01:58:27

+1 A:

First of all, forget about UCS-2: it is obsolete. It contains only a subset of Unicode characters. Forget about UTF-32 too: it is very large and very redundant. It is not useful for data transmission.

In web pages, the most economical one is UTF-8 if most of the languages you handle are Western-like (Latin, Cyrillic, Greek, etc.). But if bandwidth and loading times are not an issue, you could equally well use UTF-16. Just make sure that you always know which format the data is in when you handle a byte[]. And don’t try to convert to obsolete 8-bit character sets such as ISO-8859 or Windows-1252, because you will lose data if you do.

In C# code, your string objects will internally be in UTF-16, and there’s nothing you can do about that. So your normal string operations (e.g. Substring()) are unaffected by your choice of output format. One could argue that this makes it more performant to encode as UTF-16, but it’s not worth it if you’re going to transmit it across the Internet, where the cost of transmitting the larger UTF-16 outweighs the tiny processing gain.

In SQL Server, you should use nvarchar(...).

Timwi 2010-08-13 02:00:31

+2 A:

UTF-8 or UTF-16 are both good choices. They both give you access to the full range of Unicode code points without using up 4 bytes for every character.

Your choice will be influenced by the language you're using and its support for these formats. I believe UTF-8 plays best with ASP.NET overall, but it will depend on what you're doing.

UTF-8 is often a good choice overall because it plays well with code that expects only ASCII, whereas UTF-16 doesn't. It is also the most efficient way of representing content largely consisting of our English alphabet, while still allowing the full repertoire of Unicode when needed. A good reason for choosing UTF-16 would be if your language/framework used it natively, or if you're going to be mainly using characters that aren't in ASCII, such as Asian languages.

thomasrutter 2010-08-13 02:04:05

+5 A:

This is a non-issue because you say:

i am using Asp.net and c# and SqlServer 2005

SqlServer uses UTF-16 in some places (ntext, nvarchar, nchar) and UTF-8 in a few XML-centric places, without you doing anything weird.

C# uses UTF-16 in all its strings, with tools to encode when it comes to dealing with streams and files that bring us onto...

ASP.NET uses UTF-8 by default, and it's hard to think of a time when it isn't a good choice (even with Asian languages, the textual concision of such languages combined with the fact that the names and symbols with special meaning in HTML, CSS, javascript, most XML applications and other streams you will be sending are from the range U+0000 to U+007F, makes the advantage of UTF-16 over UTF-8 in that range less significant than with plain text of Asian languages).

The talking between the UTF-16 of SqlServer and C# and the UTF-8 that ASP.NET does by in reading and writing, is done for you with default settings, but since this is the one bit you can readily change, my answer therefore would be to use UTF-8. Really you'll be using a mixture of -8 and -16, but you won't notice most of the time (have you noticed that you've already been doing so).

SQL Server is a bit less forgiving, if only because a lot of outdated examples have text expected for human consumption being put in varchar, text or char fields. Use these purely for codes (e.g. all ISO country codes are in the range of char(2), so nchar(2) would just waste space), and only nvarchar, ntext and nchar for things people rather than machines will read and write.

Jon Hanna 2010-08-13 02:24:38

+1 - But I would store a 2-character code as nchar(2), because that avoids all code-page conversions that would occur during all the reads and writes from and to the table. It trades time for space. In general, the rule "always Unicode all the time" has served me well.

Jeffrey L Whitledge 2010-08-13 17:18:13

Performance-wise, it depends on which operations are being done, and for some it is better in time and space. That's not why I use it though, I use it because the definition of char is closer than that of nchar to the range specified for such codes in the standard. I like the datatype that best matches the definition (which is why I'm always grumbling about the way SQLServer forces a choice between oft-wasteful ntext vs. sometimes truncating nvarchar(4000) compared to postgres where if the data has no logical end-limit you just call it text, whether it'll be 2chars or 2million and the db deals

Jon Hanna 2010-08-13 17:33:35

+1 A:

So i am now confused what to use now UTF-8 / UTF-16 / UTF-32 / UCS-2

which is better for Multilingual content and performance etc.

UCS-2 is obsolete: It can no longer represent every Unicode character. UTF-8, UTF-16, and UTF-32 all can. But why have three different ways to encode the same characters?

Because in the old days, programmers made two big assumptions about strings.

That strings consist of 8-bit code units.
That 1 character = 1 code unit.

The problem for multilingual text (or even for monolingual text if that language happened to be Chinese, Japanese, or Korean) is that these two assumptions combined limit you to 256 characters. If you need to represent more than that, you need to drop one of the assumptions.

Keeping assumption #1 and dropping assumption #2 gives you a variable-width (or multi-byte) encoding. Today, the most popular variable-width encoding is UTF-8.

Dropping assumption #1 and keeping assumption #2 gives you a wide-character encoding. Unicode and UCS-2 were originally designed to use a 16-bit fixed-width encoding, which would allow for 65,536 characters. Early adopters of Unicode, such as Sun (for Java) and Microsoft (for NT) used UCS-2.

However, a few years later, it was realized that even that wasn't enough for everybody, so the Unicode code range was expanded. Now if you want a fixed-width encoding, you have to use UTF-32.

But Sun and Microsoft had written huge APIs based around 16-bit characters, and weren't enthusiastic about rewriting them for 32-bit. Fortunately, there was still a block of 2048 unassigned characters out of the original 65,536-character "Basic Multilingual Plane", which could be assigned as "surrogates" to be used in pairs to represent supplementary characters: the UTF-16 encoding form. Unfortunately, UTF-16 meets neither of the original two assumptions: It's both non-8-bit and variable-width.

In summary:

Use UTF-8 when the assumption of 8-bit code units is important.

This applies to:

Filenames and related OS calls on Unix systems, which had an established tradition of allowing variable-width encodings, but can't accept '\x00 bytes within strings and thus can't use UTF-16 or UTF-32. In fact, UTF-8 was originally designed for a Unix-based OS (Plan 9).
Communications protocols designed around streams of octets.
Anything that requires binary compatibility with US-ASCII but gives no special treatment to byte values above 127.

Use UTF-32 when the assumption of a fixed-width encoding is important.

This is useful when you care about the properties of characters as opposed to their encoding, such as the Unicode equivalents to the ctypes.h functions like isalpha, isdigit, toupper, etc.

Use UTF-16 when neither assumption is that important, but your platform used to use UCS-2.

Are you writing for Windows, or for the .NET framework designed for it? For Java? Then UTF-16 is your default string type; might as well use it.

Since you are using C#, all of your strings will be encoded in UTF-16. ASP.NET will encode the actual HTML pages in UTF-8, but this is done behind the scenes and you don't need to care.

Size considerations

The three UTF encoding forms require different amounts of memory to represent a character:

Characters U+0000 to U+007F (ASCII) require 1 byte in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
Characters U+0080 to U+07FF (IPA symbols, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, Thaana, NKo) require 2 bytes in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
Characters U+0800 to U+FFFF (the rest of the BMP, mostly for Asian languages) require 3 bytes in UTF-8, 2 bytes in UTF-16, or 4 bytes in UTF-32.
Characters U+10000 to U+10FFFF require 4 bytes in all three encoding forms.

Thus, if you want to save space, use UTF-8 if your characters are mostly ASCII, or UTF-16 if your characters are mostly Asian.

dan04 2010-08-13 03:12:27

ansaurus

tags:

views:

answers:

UTF-8 or UTF-16 or UTF-32 or UCS-2

Size considerations

related questions