views:

61

answers:

4

I'm working on a project which needs to be Unicode aware. PHP provides bunch of useful functions like str_count_words() to calculate the number of words in some input, but they won't work against UTF-8 data in PHP < 6 which is a shame. The same applies to strlen(), strrev(), etc.

What should I do about this? PHP 6 is still not even out yet so I can't require people to have it to use my software...

Should I just write a wrapper library for string functions that will either use PHP 6's functions or my own in case the version is below 6?

A: 

As you suggest, you could create a wrapper library until PHP 6 becomes a standard install. (This is years away bearing in mind the speed of ISP take-up for PHP 5.

The mb_string functions way well prove to be sufficient in the meantime.

middaparka
+3  A: 

For multi-byte strings functions, you should check Multibyte String Functions

For example, there is a mb_strlen function -- equivalent of strlen, but which works with UTF-8

Unfortunatly, there are only a few of those functions, and all str* don't necessarily have an mb_str* equivalent... Still, it is definitly possible to create a websiste that's 100% UTF-8 in PHP 5.x

Pascal MARTIN
+3  A: 

You could use the multibyte string comparison functions.

Another good idea might be looking at how others do it, especially well-established and matured systems like Wordpress and Drupal. As far as I am aware, they all have own wrappers around multibyte functions.

Additional possibly interesting resources:

Pekka
+1  A: 

I've created a wrapper class for this in PHP 5 (IMO this is the only reliable way to go), here are some my implementations of the functions you mentioned:

function iki_String_str_word_count($string, $format = 0, $search = null)
{
    $result = iki()->Regex->Match_All($string, '[\p{L}\p{Mn}\p{Pd}\'\x{2019}' . iki()->Regex->Quote($search) . ']+', 0, 'u');

    if ($format == 0)
    {
     return count($result);
    }

    return $result;
}

function iki_String_strlen($string)
{
    return strlen(utf8_decode($string));
}

There is also an open-source project @ SourceForge called PHP UTF-8 that implements a lot of the str_* family of functions. The Kohana PHP Framework also claims to be 100% UTF-8 compatible.

Alix Axel
PS: Forgive my specific methods, but I'm in a hurry and can't translate the code right now. Hopefully it'll be just enough to pass the message.
Alix Axel