views:

756

answers:

5

Today I ran into a problem with the php function strpos(), because it returned FALSE even if the correct result was obviously 0. This was because one parameter was encoded in UTF-8, but the other (origin is a HTTP GET parameter) obviously not.

Now I have noticed that using the mb_strpos function solved my problem.

My question is now: Is it wisely to use the PHP multibyte string functions generally to avoid theses problems in future? Should I avoid the traditional strpos, strlen, ereg, etc., etc. functions at all?

Notice: I don't want to set mbstring.func_overload global in php.ini, because this leads to other problems when using the PEAR library. I am using PHP4.

+1  A: 

There have been some problems with the mb_ * functions in PHP versions prior to 5.2. So if your code is going on multiple platforms with different versions of PHP, strange behavior can occur. Furthermore the mb_ strpos function is rather slow, it has to skip the number of characters specified by the offset parameter to get the real byte position used internally. In loops depending on the strpos/mb_strpos functionality this can become a major bottleneck.

Björn
+1, good too know.
altermativ
+1  A: 

You don't necessarily have to use mb_strpos, but you do need to make sure that all the data in your app is the same: either an mb_string, or a plain string in one particular encoding. (Usually UTF-8.)

If you make sure your pages are UTF-8, and your form submissions are interpreted as UTF-8, and your database stores UTF-8, you'll generally be OK. Indexed string operations (in particular truncations) can break a UTF-8 sequence, which is annoying but not generally disastrous. If you do need that level of support, mb_strings are your only option (but of course you have to make sure that all parts of your app and libraries and PHP version can cope with them properly).

Developing sites that handle Unicode correctly in PHP isn't too much fun right now: its Unicode support is very poor compared to languages like Python and .NET. It is hoped PHP6 will improve matters.

bobince
+1  A: 

If you use the same encoding everywhere it generally isn't a problem. I use UTF-8 for all my pages, and have never actually encountered this problem. In the end it really comes down to specifying the same encoding for the pages and the database.

For example:

header('Content-type: text/html;charset=utf-8');
mysql_query('SET NAMES utf8');

In most cases this means that all the data sources for the application will deliver data in the same encoding, and thus you'll avoid this kind of problems.

This will all be much better with the advent PHP 6, btw, since it will include full unicode-support.

Emil H
PHP recommends using mysql_set_charset() instead of making a query with SET NAMES in it, for reliability. Also note that just setting the charset in the Content-type will not guarantee that browsers will only sent UTF-8; some browsers ignore it (and users can override it). You need to filter too.
thomasrutter
I used "SET NAMES" in order to indicate that you can do this regardless of DB api. After all, the old mysql-functions are rather outdated these days.Good point about filtering, though. :)
Emil H
+2  A: 

It depends on the character encoding you are using. In single-byte character encodings, or UTF-8 (where a single byte inside a character can never be mistaken for another character), then as long as the string you are searching in and the string you are using to search are in the same encoding then you can continue to use the regular string search functions.

If you are using a multi-byte encoding other than UTF-8, which does not prevent single bytes within a character from appearing like other characters, then it is never safe to do a string search using the regular string search functions. You may find false positives. This is because PHP's string comparison in functions such as strpos is per-byte, and with the exception of UTF-8 which is specifically designed to prevent this problem, multi-byte encodings suffer the problem that any subsequent byte in a character made up of more than one byte may match part of a different character.

If the string you are searching in and the string you are searching for are of different character encodings, then conversion will always be necessary. Otherwise you'll find that for any string that would be represented differently in the other encoding, it will always return false. You should do such conversion on input: decide on a character encoding your app will use, and be consistent within the application. Any time you receive input in a different encoding, convert on the way in.

thomasrutter
A: 

I would recommend using the following PHP UTF-8 library:

http://sourceforge.net/projects/phputf8

Bundling it with your application loosens your application's requirements by not requiring the mbstring extension, but you still get UTF-8 string functions.

Jordan Ryan Moore