Why does anyone use an encoding other than UTF-8?

views:

1143

answers:

+6 Q:

Why does anyone use an encoding other than UTF-8?

I want to know why any developer would need to use an encoding other than UTF-8.

+4 A:

Sometimes they are restricted due to historical/unsupported reasons (I'm developing on Windows using Zend Studio on a Samba share on a Linux box: and something in that mix means I keep reverting to Cp1512 instead of UTF8).

Sometimes you don't need to use UTF-8 (for example when storing a md5 hash in a database: you only need the hexadecimal range 0-9 A-F: why make it a UTF-8 field, which will take at least a byte extra storage instead of normal ASCII).

Sometimes it's just laziness learning the UTF-8 functions for a particular language.

Richy C. 2009-07-29 13:01:13

Why would the UTF8 representation of hex digits occupy more storage than the ASCII representation? The byte values are the same in the two encodings.

Jonathan Leffler 2009-07-29 13:12:05

UTF-8 does not take more bytes than ASCII for encoding ASCII. Why do you think it needs an extra byte?

robcast 2009-07-29 13:17:06

Ok, perhaps I should have qualified it a bit more. I've seen some implementations [if I recall correctly, it *might* have been Oracle] store a byte order marker (BOM) for all UTF8 data fields: some implementations don't use it unless the data is non-ASCII, some don't use it unless the BOM differs from the "default".

Richy C. 2009-07-29 14:25:00

@Richy C: Cp1512??? Do you mean cp1251? cp1252?

John Machin 2009-07-31 14:56:51

Yep John, seems I made a typo: I meant Cp1252 instead Cp1512. D'uh! Principle is the same though ;)

Richy C. 2009-07-31 16:37:11

+3 A:

One legitimate reason is when you need to deal with legacy documents, software or hardware that are not Unicode compatible.

Another legitimate reason is that you need to use a programming language / libraries that do not support UTF8 / Unicode well ... or at all.

Other answers mention that UTF-16 is more compact than UTF-8 for Asian languages / characters.

And of course there are reasons like short-sightedness, ignorance, laziness ... and deadlines.

Stephen C 2009-07-29 13:02:34

+1 another nice summary, and that real-world edge, ooh can't beat it.

Smandoli 2009-07-29 13:27:39

+12 A:

Wikipedia lists advantages and disadvantages of UTF-8 as compared to a variety of other encodings:

http://en.wikipedia.org/wiki/UTF-8#Advantages_and_disadvantages

The most important disadvantages are IMHO that UTF-8 might use significantly more space especially in Asian languages such as Chinese, Japanese or Hindi and that not all code points have the same size which makes measurements more difficult and many string operations such as search inefficient.

0xA3 2009-07-29 13:03:16

Not all code points have the same size in UTF-16.

Craig McQueen 2009-07-29 13:14:38

But there are other encodings where this is the case such as UCS-2, ASCII, etc.

0xA3 2009-07-29 15:31:20

+1 A:

Because you sometimes want to operate easily on codepoints -- then you'd choose f.e. UCS-2 or UCS-4.

liori 2009-07-29 13:04:43

UCS-2 is limited to the BMP. Certainly not the smartest choice nowadays.

Joey 2009-07-29 13:10:32

@Joey, not a problem if you know that every character in your string is in the BMP. If you define it as UCS-2 then you know that every character is the same width (2 bytes) but if you define it as UTF-16 (even though the encoded bytes may be identical) you have to check for surrogate pairs.

finnw 2010-09-08 09:45:15

@finn: I don't consider developer laziness a valid reason to impose arbitrary restrictions on users. Unicode hasn't been a 16-bit code for quite some time now; there is no reason to perpetuate invalid assumptions. By the same argument you can probably make ISO 8859 look beneficial but it's not.

Joey 2010-09-08 11:22:28

@Joey, it's not about laziness. If the source (e.g. a legacy database) is already limited to 8859-1 for example, you could convert it to UTF-8 (which you might want to do for consistency if the rest of your system uses UTF-8) but it's a trade-off because then you no longer have fixed-width characters. If you leave it as 8859-1 or convert it to UCS-2 or UTF-32 you still have a fixed-width encoding. This is *not* the case if you convert it to UTF-8, UTF-16, GB18030 etc.

finnw 2010-09-08 11:40:25

(continued) This can be a pain (and require costly modifications) when using APIs that were originally designed for a fixed-width encoding (ASCII or UCS-2) and later "retrofitted" to treat the same arguments as UTF-8 or UTF-16. I have seen this in some Java projects (Java has migrated from UCS-2 to UTF-16.)

finnw 2010-09-08 11:42:02

+7 A:

In UTF-8 code points between 0800 and FFFF take up three bytes in UTF-8 but only two in UTF-16. See the wikipedia comparison for more details, but basically if text heavily uses code points in this range (say, if it's Chinese), UTF-8 files will be larger than UTF-16 files with the same content.

Welbog 2009-07-29 13:04:57

+3 A:

Its also worth remembering that in some circumstances (where a non-latin set of characters are needed) UTF-8 can actually bloat larger than the 16 bit Unicode encoding. In those cases ucs-2 or utf-16 would be a better choice.

AnthonyWJones 2009-07-29 13:05:07

Besides, you should never use UCS2 if you can avoid it because it can only encode part of unicode (plane 0, BMP, the 0-FFFF range) and that may break your program in interesting ways.

robcast 2009-07-29 13:13:02

+1 A:

Unicode certainly is a good place to work from in most cases, but a developer should be familiar with many different types of character encoding. Certainly ASCII might be used if the set of characters is limited.

What if you're a developer and receiving data from a source that doesn't send UTF-8? There could be lots of interface issues if you don't understand your input.

Joel's article on the must-knows for character encoding is good and worth reading.

Chet 2009-07-29 13:05:14

Thanks. Fixed.

Chet 2009-07-29 13:30:12

+6 A:

UTF-8 is very efficient at encoding plain English text (same as ASCII). If your user base is likely to be mostly, say, Chinese, you will be much better off using UTF-16.

For more information, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets.

Mac 2009-07-29 13:05:47

+1 for (a) brief clear answer and (b) pointer to Joel

Smandoli 2009-07-29 13:22:29

I found the linked blog entry from Joel very informative. Thanks!

T Pops 2009-11-20 16:09:03

+1 A:

Many APIs require other Unicode encodings - mostly UTF-16. For instance, Java, .NET, Win32.

Nemanja Trifunovic 2009-07-29 13:06:41

.NET use UTF-8 as default encoding.

awe 2010-09-16 08:31:32

@awe: not sure what you mean by "default encoding", but I can assure you that the .NET String class internally stores text as UTF-16.

Nemanja Trifunovic 2010-09-16 12:39:05

OK - you are right that [String](http://msdn.microsoft.com/en-us/library/system.string.aspx#Characters) is internally UTF-16. What I based this on is that reading from file is default read using UTF-8 encoding (see [StreamReader](http://msdn.microsoft.com/en-us/library/system.io.streamreader.aspx) ).

awe 2010-09-21 07:03:58

These differences are explained a bit closer in [this](http://stackoverflow.com/questions/1200063/why-does-anyone-use-an-encoding-other-than-utf-8/2470079#2470079) answer by **Joseph Boyle**.

awe 2010-09-21 07:19:29

+4 A:

Well, some do it because their tools are archaic or flawed. Some do it because they don't see a need to support anything other than ASCII. Some do it because they don't know any better.

Those are the usual excuses for not using Unicode.

As for not using UTF-8 specifically there are different reasons. Some systems, like Windows¹ (and stemming from that, .NET) and Java came to be in a time where Unicode was a strict 16-bit code. Therefore, there was really only one encoding: UCS-2, encoding code points directly as 16-bit words.

Later Unicode was expanded to 21 bits because 65536 code points weren't enough anymore. This caused encodings such as UTF-32 and UTF-16 to appear. For systems previously working with UCS-2 the transition to UTF-16 was the easiest and most sensible choice. Windows did that transition back in Ye Olde Days of Windows 2000.

So while I think that nearly all application nowadays should support Unicode I don't think it is entirely necessary for them to specifically use UTF-8. There are historic reasons for that and no real benefit in converting existing systems from UTF-16 to UTF-8.

¹ NT.

Joey 2009-07-29 13:08:34

+1 for more than I wanted to know, but very well summarized

Smandoli 2009-07-29 13:24:44

.... uh, and of course I didn't ask the question, so of course it's more than I wanted to know ...

Smandoli 2009-07-29 13:25:37

@Smandoli: ...but since you read this post you are interested in the subject, so any answer have the possibility of answering something that you did want to know.

awe 2010-09-16 08:23:07

+1 A:

Mostly for historical reasons.

yairchu 2009-07-29 13:09:51

At my previous employer we used iso-8859-1 for some of our ASP pages to match the collation of our SQL Server, which as you can guess was not Unicode. I wanted to change the collation, but the manager said to wait till we upgrade our SQL Server to do it. Needless to say it never happened - I haven't been with them for a little over a year now, so I don't know if they finally did it.

Waleed Al-Balooshi 2009-07-29 13:18:20

In Western Europe, the ISO-8859-1 (a.k.a. "Latin1") encoding is quite common, it was used in the DOS and early Windows days, and many places (databases, service calls) you will still find this encoding at times. So when interfacing with such a legacy system, you're likely to encounter that encoding.

Not that I would recommend it's use - UTF-8 is just so much easier to use and causes so much less trouble and friction.

Marc

marc_s 2009-07-29 13:38:13

+2 A:

jbcreix 2009-08-21 05:30:38

+3 A:

Because outside the English-speaking world, people have been using various encodings that predate Unicode and are tailored for their respective languages for decades. These language-specific encodings have become ingrained everywhere and are pretty much a standard. If you want to have any hope of interfacing with legacy systems, you have to use them, so all systems have to support them and usually use them as default even if they by now support UTF-8 as well. There may even be multiple legacy encodings traditionally used for different purposes.

Examples:

ISO-8859-1 in western Europe - actually outdated there as well, as you need ISO-8859-15 for the Euro sign
ISO-2022-JP in Japan for emails, Shift JIS for websites
Big5 in Taiwan
GB2312 in China

The last two examples show that encodings can even be a political issue.

Michael Borgwardt 2009-08-21 05:45:25

+2 A:

http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/02/cjk-unicode-angst-in-japan-and.html has a good summary + links about the difficulty Japanese users have with Unicode.

http://www.jbrowse.com/text/unij.html

http://www.hastingsresearch.com/net/04-unicode-limitations.shtml

http://www.mojikyo.org/html/abroad/abroad%5Ftop.html

Apparently Unicode is moving away from unification due to such complaints.

wrang-wrang 2009-08-21 06:21:00

+1 A:

The reasons for using non-Unicode 8-bit character sets / encodings are all back compatibility of some kind, and/or inertia. For that matter, the most frequent reasons for using UTF-8 are compatibility with standards like XML that mandate or prefer UTF-8.

Differences in the number of bytes you think text will take up in different encodings, especially in storage, are mostly theoretical. In real world situations, compatibility requirements are more important. If compression is used, the size differences go away anyway. Even if compression is not used, total text size is hard to predict and is rarely a deciding factor.

When converting legacy code that used non-Unicode 8-bit encodings, using UTF-16 can be a tool for making sure all code has been converted, because mismatches can be caught as compile-time type errors. Many languages, runtimes and libraries like Javascript, JVM, .NET, ICU use 16-bit strings and UTF-16, even though storage and Internet protocols are usually 8-bit.

Joseph Boyle 2010-03-18 13:25:18

+1 A:

Related to the subject, when using MySQL, as if it wasn't complex enough, you get the option the choose which kind of UTF-8 collation you want to use. So what would you use? "UTF-8 general ci" or "UTF-8 unicode ci"? (I tend to use the UTF-8 variant that is used for the database connection)

Jorix 2010-04-21 08:01:07

ansaurus

tags:

views:

answers:

Why does anyone use an encoding other than UTF-8?

related questions