views:

253

answers:

1

I know that in normal php regex (ASCII mode) "\w" (word) means "letter, number, and _". But what does it mean when you are using multibyte regex with the "u" modifier?

preg_replace('/\W/u', '', $string);
+3  A: 

Anything that isn't a letter, number or underscore.

So, in terms of Unicode character classes, \W is equivalent to every character that are not in the L or N character classes and that aren't the underscore character.

If you were to write it using the \p{xx} syntax, it would be equivalent to [^\p{LN}_].

Welbog
Well, I'm glad to see other assume this also, but can we backup this statement any way? I'm not sure where to dig in the PHP source or where to find someone that has verified this...
Xeoncross
@Xeoncross: It's what it's defined to be. Do you have any particular reason to doubt it's not behaving the way it's defined to behave? If you're that worried about it, just use the `\p{xx}` syntax instead.
Welbog
Ok thanks, the only reason that I was doubting it was do to the poor, lack-of-thought UTF-8 support I've come to expect from PHP functions. I didn't want to assume it worked like this if `\W` was only designed for ASCII sequences. Thanks for the fast input.
Xeoncross
@Xeoncross: If you really want to test it out, write a regex using the `\w` syntax and an equivalent one using the `\p{xx}` syntax and see if there are any discrepancies in what they match. I wouldn't expect any, but you never know.
Welbog