views:

1208

answers:

2

I'm trying to search a UTF8-encoded string using preg_match.

preg_match('/H/u', "\xC2\xA1Hola!", $a_matches, PREG_OFFSET_CAPTURE);
echo $a_matches[0][1];

This should print 1, since "H" is at index 1 in the string "¡Hola!". But it prints 2. So it seems like it's not treating the subject as a UTF8-encoded string, even though I'm passing the "u" modifier in the regular expression.

I have the following settings in my php.ini, and other UTF8 functions are working:

mbstring.func_overload = 7
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = pass
mbstring.http_output = pass
mbstring.encoding_translation = Off

Any ideas?

+4  A: 

The u modifier is only to get the pattern interpreted as UTF-8, not the subject.

This is not a nice solution, but try mb_strlen to get the length in UTF-8 characters rather than bytes:

$str = "\xC2\xA1Hola!";
preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE);
echo mb_strlen(substr($str, 0, $a_matches[0][1]));
Gumbo
Man, it's 2010 and PHP still sucks *abysmally* at Unicode.
Tomalak
"The u modifier is only to get the pattern interpreted as UTF-8, not the subject." This is not true. Compare e.g. `preg_split('//', .)` with `preg_split('//u', .)`. Since this "x is interpret as UTF-8" is a bit vague, see [this](http://www.pcre.org/pcre.txt) for the actual effects of the unicode mode.
Artefacto
A: 

looks like this is a "feature", see http://bugs.php.net/bug.php?id=37391

'u' switch only makes sense for pcre, php itself is anaware of it. From php's point of view, strings are byte sequences and returning byte offset seems logical (i don't say "correct")

stereofrog
Great...and they don't provide a mb_preg_replace.
JW