tags:

views:

104

answers:

4

Is there an easy way to match all punctuation except period and underscore, in a C# regex? Hoping to do it without enumerating every single punctuation mark.

A: 

You could possibly use a negated character class like this:

[^0-9A-Za-z._\s]

This includes every character except those listed. You may need to exclude more characters (such as control characters), depending on your ultimate requirements.

Greg Hewgill
That would get spaces too
Abe Miessler
Okay, add space to the exclusion list.
Greg Hewgill
Alright, but i want half of your rep for this question...
Abe Miessler
Would work on a limited set, but a lot of printable characters (currency symbols, mathematical symbols, diacritics etc.) are going to match this.
Wrikken
How about `º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ` etc. (you get the idea)?
Lucero
A: 

Here is something a little simpler. Not words or white-space (where words include A-Za-z0-9 AND underscore).

[^\w\s.]
orvado
+6  A: 

The answers so far do not respect ALL punctuation. This should work:

(?![\._])\p{P}

(Explanation: Negative lookahead to ensure that neither . nor _ are matched, then match any unicode punctuation character.)

Lucero
That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation?
Smashery
@Smashery These are accents, you would never use those as punctuation in the English language.
steinar
Thanks very much! I decided to accept Les's answer, because I find Regex Subtraction easier to understand conceptually; thus I'm more likely to remember it; but +1 - thanks for teaching me some new things! (Wish I could accept two answers)
Smashery
+1  A: 

Use Regex Subtraction

[\p{P}-[._]]

Here's the link for .NET Regex documentation (I'm not sure if other flavors support it)... http://msdn.microsoft.com/en-us/library/ms994330.aspx

Here's a C# example

string pattern = @"[\p{P}\p{S}-[._]]"; // added \p{S} to get ^,~ and ` (among others)
string test = @"_""'a:;%^&*~`bc!@#.,?";
MatchCollection mx = Regex.Matches(test, pattern);
foreach (Match m in mx)
{
    Console.WriteLine("{0}: {1} {2}", m.Value, m.Index, m.Length);
}

Explanation The pattern is a Character Class Subtraction. It starts with a standard character class like [\p{P}] and then adds a Subtraction Character Class like -[._] which says to remove the . and _. The subtraction is placed inside the [ ] after the standard class guts.

Les
That didn't seem to match ^, ~ or `; could I be testing it wrong, or does .NET not consider them to be punctuation?
Smashery
If you drop the -[._], then \p{P} doesn't match them either.
Les
So .NET doesn't consider them to be punctuation?
Smashery
I am surprised that the grave accent is not considered punctuation. I suppose you need to define what you mean by punctuation. You can add the "symbol" character class (\p{S}) to pickup the accent, carat and tilde. I will edit my example.
Les
Thanks for teaching me a few new things!
Smashery