tags:

views:

376

answers:

5

I am accepting user input via a web form (as UTF-8), saving it to a MySQL DB (using UTF-8 character set) and generating a text file later (encoded as UTF-8). I am wondering if there is any chance of text corruption using UTF-8 instead of something like UCS-2? Is UTF-8 good enough in this situation?

+13  A: 

More than that, it is perhaps the only encoding you should ever consider using.

Some great reading on the subject:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

karim79
Thanks for the link - I read that a while ago. I'm familiar with the different encodings (fixed length chars vs variable length chars) but for some reason I was under the impression that UCS-2 could represent more characters. I guess I was wrong. :)
Jon Tackabury
UCS-2 and UTF-16 are often mistaken - for certain codepoints they're equivalent, but for others, UTF-16 brings in surrogate codepoints to deal with the fact that not all unicode characters fit in 16 bits. Windows and Java, incidentally, are actually using UTF-16, _not_ UCS-2.
bdonlan
Note that UCS-2 has fixed-length characters, while UTF-16 has variable-length characters. Both work in 16-bit chunks. (Also note that UCS-2 is obsolete.)
John Calsbeek
It is absolutely appropriate for storing them, however if you are dealing with CJK you might want to also save the language of the string you are trying to preserve
Julik
+2  A: 

UTF-8 can represent any unicode character. As such you should have no problem with UTF-8.

In fact, UTF-8 can even represent some characters that UCS-2 cannot (UCS-2 can only represent U+0000 through U+FFFF; UTF-8, UTF-16, and UCS-4 handle all unicode codepoints)

bdonlan
+1  A: 

As far as I know, UTF-8 is designed to encompass all of these earlier Unicode variations, so yes it should be fine to use it over UCS-2. See http://www.unicode.org/versions/Unicode5.1.0/ and look down the sidebar for the 5.0 book chapters; parts 9-12 should be what you're after.

Nathan Kleyn
+10  A: 

If you are working with a great deal of Asian text (more so than Latin text), you may want to consider UTF-16. UTF-8 can accurately represent the entire Unicode range of characters, but it is optimized for text that is mostly ASCII. UTF-16 is space-efficient over the entire Basic Multilingual Plane.

But UTF-8 is most certainly "good enough"—there will not be corruption arising simply because you are using UTF-8 over, say, UTF-16.

John Calsbeek
A: 

It works wonderfully with Devanagari.

Cyril Gupta