views:

237

answers:

5

What is better for PHP developers - Unicode or UTF-8?

I am going to create an international CMS. So I am going to have clients all over the world. They will speak all possible languages.

What encoding format is better for browser recognition and for DB data storage?

+8  A: 

"Unicode" is not an encoding. You may mean UTF-8 versus UTF-16 (big-endian or little-endian). It really doesn't matter much for browser support. Any modern browser will support all three. You will probably find UTF-8 is the most space-efficient for your database.

Matthew Flaschen
+3  A: 

UTF-8 is a Unicode encoding. You probably meant that you want to choose between UTF-8 and UTF-16.

Microsoft recommends that

Developers should use UTF-8 for all Unicode data that they send to and receive from the browser.

For database storage, use the encoding your RDBMS has better support for. Or, all else being equal, choose based on space efficiency. UTF-8 is smaller for English and most European languages, while UTF-16 tends to be smaller for Asian languages.

dan04
+5  A: 

UTF-8 is an encoding of Unicode, a way of representing an (abstract) sequence of Unicode characters as a (concrete) sequence of bytes. There are other encodings, such as UTF-16 (which has both big-endian and little-endian variants). Both UTF-8 and UTF-16 can represent any character in Unicode, so you can support all languages regardless of which one you choose.

UTF-8 is useful if most of your text is in Western languages since it represents ASCII characters in just one byte, but it needs three bytes each for many characters in "foreign" alphabets such as Chinese. UTF-16, on the other hand, uses exactly two bytes for all characters you're likely to ever encounter (though some very esoteric characters, those outside Unicode's "Basic Multilingual Plane", require four).

I wouldn't recommend using PHP for developing international software, though, because it doesn't really properly support Unicode. It has some add-on functions for working with Unicode encodings (look at the multibyte string functions), but the the PHP core treats strings as bytes, not characters, so the standard PHP string functions are not suitable for working with characters that are encoded as more than one byte. For example, if you call PHP's strlen() on a string containing the UTF-8 representation of the character "大", it will return 3, because that character takes up three bytes in UTF-8, even though it's only one character. Using string-splitting functions like substr() is precarious because if you split in the middle of a multi-byte character you corrupt the string.

Most other languages used for Web development, such as Java, C#, and Python, have built-in support for Unicode, so that you can put arbitrary Unicode characters into a string and not need to worry about which encoding is used to represent them in memory because from your point of view a string contains characters, not bytes. This is a much safer, less-error-prone way to work with Unicode text. For this and other reasons (PHP isn't really that great a language), I'd recommend using something else.

(I've read that PHP 6 will have proper Unicode support, but that's not available yet.)

Wyzard
+1 for the explanation about UTF-*, -1 for discouraging the use of PHP entirely for i18n apps. As long as you're aware that you need to use the `mb_` functions for string handling when it matters, PHP is perfectly adequate for i18n apps. This should not be a criterium for or against it.
deceze
Adequate, yes, but not the best choice IMO.
Wyzard
Actualy I see some point in seeng in 大 3 letters... Because if you put 大 into DB your DB will not think of 大 as of 1 ANCII Charecter...
Blender
@Ole Jak: If you're storing multibyte characters in a database it's important to know what encoding is used in the database so that you can determine the byte length. Note that the encoding used within the database isn't necessarily the same as the encoding you use in your application code.
Wyzard
A: 

It is better to use UTF-8, because which refers all language's accents around the world. Also UTF-8 has an extended provisions to add more unused or recognized chars too. I prefer and use always UTF-8 and its series.

VAC-Prabhu
+1  A: 

Unicode is a standard which defines a bunch of abstract characters (so-called code points) and their properties (is it a digit, is it uppercase etc.). It also defines certain encodings (methods to represent characters with bytes), UTF-8 being one of them. See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Spolsky for more details.

I would certainly go with UTF-8, it is the standard everywhere these days, and has some nice properties such as leaving all 7-bit ASCII characters in place, which means that most HTML-related functions such as htmlspecialchars can be used directly on the UTF-8 representation, so you have less chance of leaving encoding-related security codes. Also, a lot of PHP functions explicitly expect UTF-8 strings, and UTF-8 has better text editor support than alternatives like UTF-16, too.

Tgr