tags:

views:

183

answers:

2

I'm looping through thousands of strings with various regexes to check for simple errors. I would like to add a regex to check for the correct use of commas.

If a comma exists in one of my strings, then it MUST be followed by either whitespace or exactly three digits:

  • valid: ,\s
  • valid: ,\d\d\d

But if a comma is followed by any other pattern, then it is an error:

  • invalid: ,\D
  • invalid: ,\d
  • invalid: ,\d\d
  • invalid: ,\d\d\d\d

The best regex I've come up with thus far is:

Regex CommaError = new Regex(@",(^(\d\d\d)|\S)"); // fails case #2

To test, I am using:

if (CommaError.IsMatch(", ")) // should NOT match
    Console.WriteLine("failed case #1");
if (CommaError.IsMatch(",234")) // should NOT match
    Console.WriteLine("failed case #2");
if (!CommaError.IsMatch("0,a")) // should match
    Console.WriteLine("failed case #3");
if (!CommaError.IsMatch("0,0")) // should match
    Console.WriteLine("failed case #4");
if (!CommaError.IsMatch("0,0a1")) // should match
    Console.WriteLine("failed case #5");

But the regex I gave above fails case #2 (it matches when it should not).

I've invested several hours investigating this, and searched the Web for similar regexes, but have hit a brick wall. What's wrong with my regex?

Update: Peter posted a comment with a regex that works the way I want:

Regex CommaError = new Regex(@",(?!\d\d\d|\s)");

Edit: Well, almost. It fails in this case:

if (!CommaError.IsMatch("1,2345")) // should match
    Console.WriteLine("failed case #6");
A: 

In which language are your trying to do this? This is perl-comaptible regular expression to match such case: ,(?!(\s|\d{3}[^\d])) (it will match commas not followed by space or exact 3 digits, so if string matches this regexp it is not valid)

krcko
This one matches ,233 which it should not match
Andomar
I'm using C#.Using your regex, Regex CommaError = new Regex(@",(?!(\s|\d{3}[^\d]))");It fails test case #2 for some reason.
gw
It's failing because the `[^\d]` is saying there has to be a non-digit after the 3 digits. Since the 233 (or 234 in case #2) is at the end of the string, there is no non-digit after the 3 digits.
Laurence Gonsalves
Instead of `[^\d]` it should be another lookahead: `(?!\d)`. @Laurence, your regex should have that, too. Currently, it fails to flag a comma that's followed by four or more digits, e.g. `1,2345`.
Alan Moore
+5  A: 

You can only use ^ to mean not inside of a character class (eg: [^a-b]) in most regex syntaxes.

The simplest thing for you to do wuld be to invert the condition in your if statement.

If you can't do that for whatever reason you can use a negative lookahead in some regex syntaxes. eg:

,(?!\d\d\d(?!\d)|\s)

In regex syntaxes that don't support negative assertions you can still do what you want, but the bigger the negative match the more complicated the regex gets. eg:

,($|[^ \d]|\d$|\d[^\d]|\d\d$|\d\d[^\d]|\d\d\d\d)

Essentially you have to enumerate all of the bad cases.

Laurence Gonsalves
You don't need the non-capturing group when doing alternation inside a lookahead, using `,(?!\d\d\d|\s)` will work.
Peter Boughton
Peter: good point. It was harmless, but unnecessary. I've removed it.
Laurence Gonsalves
Peter, thank you, your regex works the way I want it to. :)
gw
+1 Learning something every day on SO :) haha
Andomar
The lookahead version is not quite right. See my comment to @krcko's answer.
Alan Moore
Alan: thanks for pinting that out I hadn't noticed the "exactly" bit of the question (or the `,\d\d\d\d` testcase). I've updated both regexes.
Laurence Gonsalves