Is UTF-8 acceptable for reading/writing Asian languages?

views:

376

answers:

+6 Q:

Is UTF-8 acceptable for reading/writing Asian languages?

I am accepting user input via a web form (as UTF-8), saving it to a MySQL DB (using UTF-8 character set) and generating a text file later (encoded as UTF-8). I am wondering if there is any chance of text corruption using UTF-8 instead of something like UCS-2? Is UTF-8 good enough in this situation?

+13 A:

More than that, it is perhaps the only encoding you should ever consider using.

Some great reading on the subject:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

karim79 2009-08-11 17:46:15

Thanks for the link - I read that a while ago. I'm familiar with the different encodings (fixed length chars vs variable length chars) but for some reason I was under the impression that UCS-2 could represent more characters. I guess I was wrong. :)

Jon Tackabury 2009-08-11 17:54:28

UCS-2 and UTF-16 are often mistaken - for certain codepoints they're equivalent, but for others, UTF-16 brings in surrogate codepoints to deal with the fact that not all unicode characters fit in 16 bits. Windows and Java, incidentally, are actually using UTF-16, _not_ UCS-2.

bdonlan 2009-08-11 17:56:58

Note that UCS-2 has fixed-length characters, while UTF-16 has variable-length characters. Both work in 16-bit chunks. (Also note that UCS-2 is obsolete.)

John Calsbeek 2009-08-11 18:00:12

It is absolutely appropriate for storing them, however if you are dealing with CJK you might want to also save the language of the string you are trying to preserve

Julik 2009-08-13 22:01:10

+2 A:

UTF-8 can represent any unicode character. As such you should have no problem with UTF-8.

In fact, UTF-8 can even represent some characters that UCS-2 cannot (UCS-2 can only represent U+0000 through U+FFFF; UTF-8, UTF-16, and UCS-4 handle all unicode codepoints)

bdonlan 2009-08-11 17:46:42

+1 A:

As far as I know, UTF-8 is designed to encompass all of these earlier Unicode variations, so yes it should be fine to use it over UCS-2. See http://www.unicode.org/versions/Unicode5.1.0/ and look down the sidebar for the 5.0 book chapters; parts 9-12 should be what you're after.

Nathan Kleyn 2009-08-11 17:48:20

+10 A:

If you are working with a great deal of Asian text (more so than Latin text), you may want to consider UTF-16. UTF-8 can accurately represent the entire Unicode range of characters, but it is optimized for text that is mostly ASCII. UTF-16 is space-efficient over the entire Basic Multilingual Plane.

But UTF-8 is most certainly "good enough"—there will not be corruption arising simply because you are using UTF-8 over, say, UTF-16.

John Calsbeek 2009-08-11 17:52:03

It works wonderfully with Devanagari.

Cyril Gupta 2009-08-11 18:23:23

ansaurus

tags:

views:

answers:

Is UTF-8 acceptable for reading/writing Asian languages?

related questions