views:

330

answers:

3

I have a site I want to migrate from ISO to UTF-8.

I have a record in database indexed by the following primary key :

s:22:"Informations générales";

The problem is, now (with UTF-8), when I serialize the string, I get :

s:24:"Informations générales";

(notice the size of the string is now the number of bytes, not string length)

So this is not compatible with non-utf8 previous records !

Did I do something wrong ? How could I fix this ?

Thanks

+2  A: 

PHP 4 and 5 do not have built-in Unicode support; I believe PHP 6 is starting to add more Unicode support although I'm not sure how complete that is.

Amber
I know that, I just wanted to know about the "primary key" situation
Matthieu
A: 

You did nothing wrong. PHP prior to v6 just isn't Unicode aware, and as such doesn't support it, if you don't beat it to be (i.e., via the mbstring extension or other means).

We here wrote our own wrapper around serialize() to remedy this. You could, too, move to other serialization techniques, like JSON (with json_encode() and json_decode() in PHP since 5.2.0).

Boldewyn
A: 

The behaviour is completely correct. Two strings with different encodings will generate different byte streams, thus different serialization strings.

soulmerge
Ok so this is normal : the process of serialization needs memory length, not string length
Matthieu
@Matthieu: I know this sounds strange, but in PHP strings are actually byte arrays. You will get the same output if you `echo strlen($utf8EncodedString)`. For the *character* length, you need `mb_strlen()`.
soulmerge
Another one: `file_get_contents()` will give you a string (even when getting contents of binary files). Socket functions, too.
soulmerge
Yep I know all that, its just I wanted to be sure if serialize was using "string length" or "memory size". And apparently this is memory size, so I just have to re-generate all my database records that contain serialized data (previously encoded in ISO) with PHP UTF-8
Matthieu