tags:

views:

1213

answers:

6

I am able use UTF-8 characters just fine in my scripts. As a matter of fact it is possible to have names of variables and functions contain Unicode characters. There is also the mb_string extension, which deals with multi-byte strings. Yet in countless articles, PHP is criticized for its lack of Unicode support. I don't get it; why is PHP said to not support Unicode?

+9  A: 

When PHP was started several years ago, UTF-8 was not really supported. We are talking about a time when non-Unicode OS like Windows 98/Me was still current and when other big languages like Delphi were also non-Unicode. Not all languages were designed with Unicode in mind from day 1, and completely changing your language to Unicode without breaking a lot of stuff is hard. Delphi only became Unicode compatible a year or two ago for example, while other languages like Java or C# were designed in Unicode from Day 1.

So when PHP grew and became PHP 3, PHP 4 and now PHP 5, simply no one decided to add Unicode. Why? Presumably to keep compatible with existing scripts or because utf8_de/encode and mb_string already existed and work. I do not know for sure, but I strongly believe that it has something to do with organic growth. Features do not simply exist by default, they have to be written by someone, and that simply did not happen for PHP yet.

PS: PHP 6 will finally have proper Unicode Support.

Edit: Ok, I read the question wrong. The question is: How are strings stored internally? If I type in "Währung" or "Écriture", which Encoding is used to create the bytes used? In case of PHP, it is ASCII with a Codepage. That means: If I encode the string using ISO-8859-15 and you decode it with some chinese codepage, you will get weird results. The alternative is in languages like C# or Java where everything is stored as Unicode, which means: There is no codepage anymore, and theoretically you cannot mess up. I recommend Joel's article about Unicode and Character Sets, but essentially it boils down to: How are strings stored internally, and the answer with PHP is "Not in Unicode", which means that you have to be very careful and explicit when processing strings to make sure to always keep the string in the proper encoding during input, storage (database) and output, which is very errorprone.

Michael Stum
That's not quite what I was asking. What factors make PHP Unicode-incompatible?
orlandu63
Don't want to beat a dead horse here, but "PHP 6 will have _____" has been a common refrain for a years now. When is the darn thing going to be out? Will it even see widespread adoption now that old php code is so common?
TM
PHP 5 had the same problem, some people are still running php 4 for that reason (actually my own WebHost uses PHP 4 by default, I have to use a .htaccess to get PHP 5, and they even still offer PHP 3(!)). When PHP 6 finally comes out, it will surely take a loooong time before the adoption is big.
Michael Stum
I like that you've linked to a Joel article that slates PHP for being backward and not supporting Unicode properly yet - and he wrote it in 2003!
MarkJ
+1  A: 

What is meant by 'support' is 'native support'. Take a look at this to get detailed information.

muratgu
That article is nearly 4 years old--hardly accurate information now.
postfuturist
+3  A: 

You say it yourself: in order to correctly deal with strings that contain multibyte characters, you need to use an extension. Forget anywhere to use the extension functions instead of the more familiar "normal" ones, and your data is mutilated. The same happens if you use a third-party library that hasn't been updated to use the extension function everywhere.

Also, a number of extremely popular encodings is still explicitly not supported by PHP, presumably because it's impossible to do so and stay downwards-compatible.

Michael Borgwardt
+3  A: 

Many of the common extensions do not have unicode support or (even worse) you "need to know" that a string contains unicode/utf-8 sequences, like for example XMLReader. And it can make quite a difference wether PHP's glob() calls FindFirstFileA or FindFirstFileW on win32.
Another (much smaller but surprisingly often being the source of annoyance) issue are BOMs which PHP do not recognize.

VolkerK
+1  A: 

Many of the string functions are just thin wrappers around C library equivalents, which also treat everything as a sequence of bytes. Another reason is that PHP carries around a lot of needless backward-compatibility baggage and thus gets stuck with bad design decisions from 3&4.

Maybe with 5.3's namespaces they'll finally have a way of phasing the old functions out.

Ant P.
+2  A: 

The concept of a "multibyte character" is at the core of the problem.

  1. It leaks an implementation detail: you should be able to work with the abstraction of a character without knowing how the implementers choose to represent the data - maybe depending on the platform it suits them to represent everything as UTF16 or UTF32, in which case everything is multibyte, not that the users of the character abstraction should care.
  2. It's a kludge: On top of an out-of-date habit of thought where we all "really know" that strings are byte sequences, we now have to know that sometimes the bytes clump together into things known as Unicode characters, and have special cases all over the place to deal with it.
  3. It's like a mouse trying to eat an elephant. By framing Unicode as an extension of ASCII (we have normal strings and we have mb_strings) it gets things the wrong way around, and gets hung up on what special cases are required to deal with characters with funny squiggles that need more than one byte. If you treat Unicode as providing an abstract space for any character you need, ASCII is accommodated in that without any need to treat it as a special case.
d__