tags:

views:

87

answers:

6

I'm quite new to regular expressions and I'm trying to create a regex for the validation of an invoice format.

The pattern should be: JjYy (all 4 characters are legit), used 0, 2 or 4 times e.g. no Y's at all is valid, YY is valid, YYYY is valid, but YYY should fail. Followed by a series of 0's repeating 3 to 10 times. The whole should never exceed 10 characters.

examples: JyjY000000 is valid (albeit quite strange) YY000 is valid 000000 is valid jjj000 is invalid jjjj0 is invalid

I learned some basics from here, but my regex fails when it shouldn't. Can someone assist in improving it?

My regex so far is: [JjYy]{0}|[JjYy]{2}|[JjYy]{4}[0]{3,10}.

The following failed also: [JjYy]{0|2|4}[0]{3,10}

+2  A: 

How about:

^([JjYy]{2}){0,2}0{3,10}$

To check the length is ten characters or less, use a string length function rather than a regular expression - don't hammer nails with a screwdriver, and so forth.

Test:

#!perl
use warnings;
use strict;

my $re = qr/^([JjYy]{2}){0,2}0{3,10}$/;

my %tests = qw/JyjY000000 valid
           YY000 valid
           000000 valid
           jjj000 invalid
           jjjj0 invalid/;

for my $k (keys %tests) {
    print "$k is ";
    if ($k =~ /$re/) {
        print "valid";
    } else {
        print "invalid";
    }
    print " and it should be $tests{$k}.\n";
}

Produces

jjjj0 is invalid and it should be invalid.
YY000 is valid and it should be valid.
JyjY000000 is valid and it should be valid.
jjj000 is invalid and it should be invalid.
000000 is valid and it should be valid.
Kinopiko
you beat me to it - and if it wasn't for those parentheses, both are same :)
Amarghosh
People often complain that I answer too quickly.
Kinopiko
Big +1 for including unit tests!
T.J. Crowder
YYYY0000000000 is matching as valid and it shouldn't be as it is over ten characters long.
Dave Webb
I'd suggest counting the length of the string for that.
Kinopiko
+3  A: 
([jJyY]{2}){0,2}0{3,10}

If the total length limit is inclusive of the jJyY part, you can check it with a negative look ahead to make sure there are no more than 10 characters in the string to begin with (?![jJyY0]{11,})

\b(?![jJyY0]{11,})([jJyY]{2}){0,2}0{3,10}\b
Amarghosh
YYYY0000000000 is valid and it shouldn't be as it is over ten characters long.
Dave Webb
If the total length limit is inclusive of the `jJyY` part, you can check it with a negative look ahead to make sure there are no more than 10 characters in the string to begin with `(?![jJyY0]{11,})`
Amarghosh
The jJyY0 part made it possible to use 0's before the JjYy's, what should be invalid. However, it was very close to what I was looking for, thanks.
Webleeuw
Can you post an example?
Amarghosh
I like the negative look ahead but you could make it simpler. With the lookahead you're only really interested in the length as you're checking the format with the rest of the regexp, so you could make your pattern: `(?!.{11,})([jJyY]{2}){0,2}0{3,10}`
Dave Webb
That's a good point, but wouldn't `\w` be better than `.`? Because `.` will eat the whole string and fail - thereby losing a possible match in a subsequent word.
Amarghosh
You may be right. I was assuming the only thing in the string would be the invoice number.
Dave Webb
@Amarghosh' comment about the [jJyY0]: I stand corrected, you are right. It might have been a side effect of my testpage in asp.net, but when I reimplemented your example in a clean page it worked.Leaves me with the dilemma who should get the accepted answer because there are multiple roads that leads to rome :$...
Webleeuw
I wasn't "fighting" to get accepted - just wanted to make sure my regex was indeed correct :)
Amarghosh
Ok, thanks for being sportive :)
Webleeuw
+1  A: 

As you need the total length to never exceed 10 characters I think you have to handle the three kinds of prefixes separately:

0{3,10}|[JjYy]{2}0{3,8}|[JjYy]{4}0{3,6}
Dave Webb
Even better, don't use a regex to do that. Regexes are not the best way to count the number of characters in a string.
Kinopiko
I would say that Regexes are ideal for verifying that strings match specified formats, and the length is usually a part of any valid format. That you have to use an "or" (or a negative look ahead as shown in one of the other answers) means that the format is more complex than you would like, but it doesn't mean you should complicate your code with a separate check. I would strongly recommend keeping all the validation in one place - the regex - as this is better from a maintenance point of view.
Dave Webb
how about making sure that the string is not longer than 10 chars to begin with using negative lookahead? I've updated my answer with such a solution.
Amarghosh
I've accepted your answer because yours was the first to match all conditions I was looking for :).
Webleeuw
+1  A: 

It may depend on what you are using to implement the regular expression. For example I found out the other day that Notepad++ only supports a few basic operators. Things like the pipe are not part of the core regex standard.

I'd suggest something like this:

([JjYy]{2}([JjYy]{2})?)?[0]{3,10}

If you're using a programming language, you'll need to use a string length function to validate the length.

EDIT: actually, you should be able to validate the length by separating the different situations:

([0]{3,10})|([JjYy]{2}[0]{3,8})|([JjYy]{4}[0]{3,6})
DisgruntledGoat
+1  A: 
PP
That expression was near the mark, except that it failed on only 0's.
Webleeuw
Probably because you don't want the slashes (`/`) in a C# regular expression. In Perl, Awk, and sed it is common to put slashes around a regular expression.
PP
A: 

I'm afraid the RegEx would be lengthy and complicated. You should validate the first part via PHP code instead.

Salman A