tags:

views:

160

answers:

4

I have previously read Spolsky's article on character-encoding, as well as this from dive into python 3. I know php is getting Unicode at some point, but I am having trouble understanding why this is such a big deal.

If php-CLI is being used, ok it makes sense. However, in the web server world, isnt it up to the browser to take this integer and turn it into a character (based off character-encoding).

What am I not getting?

A: 

Well, for one thing you need to somehow generate the strings the browser displays :-)

n3rd
yea, a string is an immutable array of bytes. practically meaningless without some sort of encoding scheme. care to elaborate?
Precisely. And if the string manipulation functions don't know the handle the encoding scheme, how are they supposed to work correctly?
n3rd
ahhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh
A: 

There's an awesome FAQ section on Unicode and the Web here. See if it answers some of your questions.

Ólafur Waage
+1  A: 

The PHP string functions often treat strings as sequences of 8-byte characters. I've had all sorts of issues with Chinese text going through the string functions. substr(), for example, can cut a multi-byte character in half, which causes all manner of problems for XML parsers.

James Socol
+2  A: 

PHP does "support" UTF8, look at the mbstring[1] extension. Most of the problem comes from PHP developers who don't use the mb* functions when dealing with UTF8 data.

UTF8 characters are often more than one character so you need to use functions which appreciate that fact like mb_strpos[2] rather than strpos[3].

It works fine if you are getting UTF8 from the browser -> putting in database -> getting it back out -> displaying it to the user. If you are doing something more involved with UTF8 data (or indeed any major text processing) you should probably consider using an alternative language.

Salgo