ansaurus

Question

PHP PREG Regex: What does "\W" mean when using the UTF-8 modifier?

Answer 1

+3 A:

Anything that isn't a letter, number or underscore.

So, in terms of Unicode character classes, \W is equivalent to every character that are not in the L or N character classes and that aren't the underscore character.

If you were to write it using the \p{xx} syntax, it would be equivalent to [^\p{LN}_].

Welbog 2010-01-07 20:26:14

Well, I'm glad to see other assume this also, but can we backup this statement any way? I'm not sure where to dig in the PHP source or where to find someone that has verified this...

Xeoncross 2010-01-07 20:30:38

@Xeoncross: It's what it's defined to be. Do you have any particular reason to doubt it's not behaving the way it's defined to behave? If you're that worried about it, just use the `\p{xx}` syntax instead.

Welbog 2010-01-07 20:31:39

Ok thanks, the only reason that I was doubting it was do to the poor, lack-of-thought UTF-8 support I've come to expect from PHP functions. I didn't want to assume it worked like this if `\W` was only designed for ASCII sequences. Thanks for the fast input.

Xeoncross 2010-01-07 20:35:47

@Xeoncross: If you really want to test it out, write a regex using the `\w` syntax and an equivalent one using the `\p{xx}` syntax and see if there are any discrepancies in what they match. I wouldn't expect any, but you never know.

Welbog 2010-01-07 20:38:03

ansaurus

tags:

views:

answers:

PHP PREG Regex: What does "\W" mean when using the UTF-8 modifier?

related questions