views:

708

answers:

13

How wide-spread is the use of UTF-8 for non-English text, on the WWW or otherwise? I'm interested both in statistical data and the situation in specific countries.

I know that ISO-8859-1 (or 15) is firmly entrenched in Germany - but what about languages where you have to use multibyte encodings anyway, like Japan or China? I know that a few years ago, Japan was still using the various JIS encodings almost exclusively.

Given these observations, would it even be true that UTF-8 is the most common multibyte encoding? Or would it be more correct to say that it's basically only used internally in new applications that specifically target an international market and/or have to work with multi-language texts? Is it acceptable nowadays to have an app that ONLY uses UTF-8 in its output, or would each national market expect output files to be in a different legacy encoding in order to be usable by other apps.

Edit: I am NOT asking whether or why UTF-8 is useful or how it works. I know all that. I am asking whether it is actually being adopted widely and replacing older encodings.

+1  A: 

Both Java and C# use UTF-16 internally and can easily translate to other encodings; they're pretty well entrenched in the enterprise world.

I'd say accepting only UTF as input is not that big a deal these days; go for it.

Randolpho
I thought Java only used UTF-16 internally, and defaulted to the JVM's default charset upon encoding a file? Or has that changed recently? Nevertheless, I've never seen UTF-16 used as a file format myself (for obvious reasons), does anyone do that? Or did you mean UCS-2?
Pieter
You're right, I should rephrase.
Randolpho
+14  A: 

We use UTF-8 in our service-oriented web-service world almost exclusively - even with "just" Western European languages, there are a enough "quirks" to using various ISO-8859-X formats to make our heads spin - UTF-8 really just totally solves that.

So I'd put in a BIG vote for use of UTF-8 everywhere and all the time ! :-) I guess in a service-oriented world and in .NET and Java environments, that's really not an issue or a potential problem anymore.

It just solves so many problems that you really don't need to have to deal with all the time......

Marc

marc_s
Yes, I know it makes life so much easier - the question is whether you can actually get away with it everywhere, or whether you'll be forced to deal with other encodings constantly whenever you leave your own app's ecosystem. I suppose it's relatively easy to get away with when you define web services; I was more thinking about documents that are handled by end users.
Michael Borgwardt
Yes, for the most part - in the service world, UTF-8 (or -16) really is the de-facto standard and hardly anyone is crazy enough to deviate from it :-)
marc_s
The reason is probably that web services are relatively new and not burdened by requirements of backwards compatibility.
Michael Borgwardt
+4  A: 

I don't think it's acceptable to just accept UTF-8 - you need to be accepting UTF-8 and whatever encoding was previously prevalent in your target markets.

The good news is, if you're coming from a German situation, where you mostly have 8859-1/15 and ASCII, additionally accepting 8859-1 and converting it into UTF-8 is basically zero-cost. It's easy to detect: using 8859-1-encoded ö or ü is invalid UTF-8, for example, without even going into the easily-detectable invalid pairs. Using characters 128-159 is unlikely to be valid 8859-1. Within a few bytes of your first high byte, you can generally have a very, very good idea of which encoding is in use. And once you know the encoding, whether by specification or guessing, you don't need a translation table to convert 8859-1 to Unicode - U+0080 through to U+00FF are exactly the same as the 0x80-0xFF in 8859-1.

Jon Bright
And of course, to determine the encoding more exhaustively, there is chardet. http://stackoverflow.com/questions/373081
ShreevatsaR
+1  A: 

I'm interested both in statistical data and the situation in specific countries.

I think this is much more dependent on the problem domain and its history, then on the country in which an application is used.

If you're building an application for which all your competitors are outputting in e.g. ISO-8859-1 (or have been for the majority of the last 10 years), I think all your (potential) clients would expect you to open such files without much hassle.

That said, I don't think most of the time there's still a need to output anything but UTF-8 encoded files. Most programs cope these days, but once again, YMMV depending on your target market.

Pieter
+2  A: 

UTF-8 is popular because it is usually more compact than UTF-16, with full fidelity. It also doesn't suffer from the endianness issue of UTF-16.

This makes it a great choice as an interchange format, but because characters encode to varying byte runs (from one to four bytes per character) it isn't always very nice to work with. So it is usually cleaner to reserve UTF-8 for data interchange, and use conversion at the points of entry and exit.

For system-internal storage (including disk files and databases) it is probably cleaner to use a native UTF-16, UTF-16 with some other compression, or some 8-bit "ANSI" encoding. The latter of course limits you to a particular codepage and you can suffer if you're handling multi-lingual text. For processing the data locally you'll probably want some "ANSI" encoding or native UTF-16. Character handling becomes a much simpler problem that way.

So I'd suggest that UTF-8 is popular externally, but rarer internally. Internally UTF-8 seems like a nightmare to work with aside from static text blobs.

Some DBMSs seem to choose to store text blobs as UTF-8 all the time. This offers the advantage of compression (over storing UTF-16) without trying to devise another compression scheme. Because conversion to/from UTF-8 is so common they probably make use of system libraries that are known to work efficiently and reliably.

The biggest problems with "ANSI" schemes are being bound to a single small character set and needing to handle multibyte character set sequences for languages with large alphabets.

Bob Riemersma
UTF-8 might be rare as an internal encoding on Windows, but it's by far the most common encoding on Unix systems, and on applications that originate on Unix platforms.
BlackAura
I was wrong above. UTF-8 encodes to as many as 6 bytes per character, not 4. I still suspect a lot of Unix software cannot handle UTF-8 properly and simply uses US ASCII or ISO 8859-1 and "calls it" UTF-8, but being an expert on neither Unix nor Unicode I won't argue the point.
Bob Riemersma
You were NOT wrong. Unicode UTF-8 goes up to only 4 bytes. The ISO version goes up to 6 but nobody is ever going to define that many characters.
John Machin
+4  A: 

Is it acceptable nowadays to have an app that ONLY uses UTF-8 in its output, or would each national market expect output files to be in a different legacy encoding in order to be usable by other apps.

Hmm, depends on what kind of apps and output we're talking about... In many cases (e.g. most web-based stuff) you can certainly go with UTF-8 only, but, for example, in a desktop application that allows user to save some data in plain text files, I think UTF-8 only is not enough.

Mac OS X uses UTF-8 extensively, and it's the default encoding for users' files, and this is the case in most (all?) major Linux distributions too. But on Windows... is Windows-1252 (close but not same as ISO-8859-1) still the default encoding for many languages? At least in Windows XP it was, but I'm not sure if this has changed? In any case, so long as significant number of (mostly Windows) users have the files on their computers encoded in Windows-1252 (or something close to that), supporting UTF-8 only would cause grief and confusion for many.

Some country specific info: in Finland ISO-8859-1 (or 15) is likewise still firmly entrenched. As an example, Finnish IRC channels use, afaik, still mostly Latin-1. (Which means Linux guys with UTF-8 as system default using text-based clients (e.g. irssi) need to do some workarounds / tweak settings.)

Jonik
+2  A: 

While it does not specifically address the question -- UTF-8 is the only character encoding mandatory to implement in all IETF track protocols.

http://www.ietf.org/rfc/rfc2277.txt

Einstein
+2  A: 

You might be interested in this question. I've been trying to build a CW about the support for unicode in various languages.

docgnome
+2  A: 

Users of CJK characters are biassed against UTF-8 naturally because their characters become 3 bytes each instead of two. Evidently, in China the preference is for their own 2-byte GBK encoding, not UTF-16.

Edit in response to this comment by @Joshua :

And it turns out for most web work the pages would be smaller in UTF-8 anyway as the HTML and javascript characters now encode to one byte.

Response:

The GB.+ encodings and other East Asian encodings are variable length encodings. Bytes with values up to 0x7F are mapped mostly to ASCII (with sometimes minor variations). Some bytes with the high bit set are lead bytes of sequences of 2 to 4 bytes, and others are illegal. Just like UTF-8.

As "HTML and javascript characters" are also ASCII characters, they have ALWAYS been 1 byte, both in those encodings and in UTF-8.

John Machin
GB18030 is now the standard in China, if memory serves.
JUST MY correct OPINION
@JUSTetc: GB18030 was the standard when I wrote that. Not all websites have upgraded. In any case GB18030 is a superset of gbk which is a superset of gb2312 ... the point is that in all 3 encodings, the most common Chinese characters take up only 2 bytes instead of the 3 of UTF-8.
John Machin
And it turns out for most web work the pages would be smaller in UTF-8 anyway as the HTML and javascript characters now encode to one byte.
Joshua
@Joshua: Not so. See my edited answer.
John Machin
+5  A: 

UTF-8 is used on 55% of websites.

dan04
+3  A: 

Here are some statistics I was able to find:

  • This page shows usage statistics for character encodings in "top websites".
  • This page is another example.

Both of these pages seem to suffer from significant problems:

  • It is not clear how representative their sample sets are, particularly for non English speaking countries.
  • It is not clear what methodologies were used to gather the statistics. Are they counting pages or counts of page accesses? What about downloadable / downloaded content.

More important, the statistics are only for web-accessible content. Broader statistics (e.g. for encoding of documents on user's hard drives) do not seem to be obtainable. (This does not surprise me, given how difficult / costly it would be to do the studies needed across many countries.)

In short, your question is not objectively answerable. You might be able to find studies somewhere about how "acceptable" a UTF-8 only application might be in specific countries, but I was not able to find any.

For me, the take away is that it is a good idea to write your applications to be character encoding agnostic, and let the user decide which character encoding to use for storing documents. This is relatively easy to do in modern languages like Java and C#.

Stephen C
+3  A: 

I tend to visit Runet websites quite often. Many of them still use Windows-1251 encoding. Also it's the default encoding in Yandex Mail and Mail.ru (two largest webmail services in CIS countries). It's also set as a the default content encoding in Opera browser (2nd after Firefox by popularity in the region) when one downloads it from Russian ip address. I'm not quite sure about other browsers though.

The reason for that is quite simple: UTF-8 requires two bytes to encode Cyrillic letters. Non-unicode encodings require 1 byte only (unlike most Eastern alphabets Cyrillic ones are quite small). They are also fixed-length and easily processable by old ASCII-only tools.

Andrew
+2  A: 

I'm interested both in statistical data and the situation in specific countries.

On W3Techs, we have all these data, but it's perhaps not easy to find:

For example, you get the character encoding distribution of Japanese websites by first selecting the language: Content Languages > Japanese, and then you select Segmentation > Character Encodings. That brings you to this report: Distribution of character encodings among websites that use Japanese. You see: Japanese sites use 49% SHIFT-JIS and 38% UTF-8. You can do the same per top level domain, say all .jp sites.

Sam