tags:

views:

169

answers:

4

I have a Perl regex /\W/i which matches all non-alphanumeric characters, but it also matches spaces which I want to ignore. How do I get it to match non-alphanumeric characters except spaces?

+10  A: 

You could use

/[^\w\s]/

This matches all non-word characters (\w) and non-whitespace (\s).

EDIT:

/[^\w ]/

If you want only to ignore spaces (not all whitespace).

UPDATE:

Removed i as it's not needed (see several comments).

steinar
Note that this matches non-WORD characters, but Joe said he wanted to match non-ALPHANUMERIC characters. `\w` includes (and `\W` excludes) at least one non-alphanumeric, `_`. You would want to use `/[^a-z0-9\s]/i` to merely exclude alphanumerics. Assuming you don't care about accented characters, etc, which would open up a whole other can of worms.
Porculus
You don't need the `/i` modifier - `/w` already is case-insensitive.
Tim Pietzcker
A: 

I agree with steinar, but don't forget to chomp as well.

Joel
+5  A: 

For most purposes, [^\w\s] should suffice. That matches just one character which is neither an "alphanumunder" nor a PerlSpace.

That's almost but not quite like saying it matches anything that is neither \p{Alphabetic} nor \p{Digit} nor the underscore (LOW LINE) nor \p{WhiteSpace}, except for the weaseling regarding chr 11, vertical tab, since that is not considered \s, although it is considered \p{WhiteSpace}.

The little \s shorthand really menas \p{PerlSpace}, not \p{WhiteSpace}. And \p{Space} is the same as \p{WhiteSpace}. The only \S character (meaning, not \s) which is also \p{Space} is that pesky vertical tab. Note that vertical tab is included in \v, so that means [\v\h], for any vertical or horizontal white space, is the same as \p{Space}, not \s.

I'm now going to get more precise regarding alphanumerics. For simplicity, I'm going to talk about positive matches. It should be easy to invert the logic to get negative matches.

If by "alphanumeric", you mean either letters or numbers, you should probably use properties that mean precisely that. \pL is short for \p{Letter}, which probably covers those. All letters are alphabetic, but there are characters that are \p{Alphabetic} yet not \p{Letter}, like Roman numerals, the circled letters, and various diacritics.

For numbers, the question if whether you mean to include digits only, or if other numbers are ok. \pN is short for \p{Number}, but that includes a lot of non-digits. \d is short for \p{Nd}, and that in turn is short for \p{Decimal_Number}, although \p{Digit} works fine, too. Numbers that aren't digits include Roman numerals, vulgar fractions, superscripted numbers, and circled digits.

Beginning sometime after Perl 5.11, you can use properties like \p{POSIX_Digit} for nothing but [0-9], \p{POSIX_Alpha} for only the letters, and \p{POSIX_Alnum} for both. There's also a \p{POSIX_Space} with that release or better, covering characters 9-13 plus 32 only, completely ignoring twenty other whitespace characters that come later.

Before then, you can still restrict your matches to the ASCII range by using a lookahead assertion that constrains the match to be ASCII only, using /(?=\p{ASCII})[\p{Alpha}\p{Digit}]/, although restricting characters to 7 bits is awfully last-millennium.

I'd probably let them use Roman numerals but not exotic diacritics, so would just use /[\p{Letter}\p{Digit}]/, which you can shorten up to /[\pL\d]/ if you'd prefer.

Now you add white space to that with \s or the slightly broader \p{Space}, giving /[\p{Letter}\p{Digit}\p{Space}]/. I would leave it in that form, too, because I think it's clearer what you mean.

To negate that, you might think to prefix it with !, but that isn't quite the same since an empty string would match. So you should put a caret at the start of the character class to complement the set, making it /[^\p{Letter}\p{Digit}\p{Space}]/.

You could not just flip the sense the \p into \P instead the way you could with a single property, since /[\P{Letter}\P{Digit}\P{Space}]/ would get letter characters that are nondigits, (white)space characters that are nondigits, digit characters that are nonspaces, etc.

There's still no reason to use /i, though.

tchrist
+2  A: 
[^\p{Alnum}\d ] # NOT alnum or space
Axeman