views:

59

answers:

2

I'm trying to construct a regex to screen valid part and/or serial numbers in combination, with ranges.

A valid part number is a two alpha, three digit pattern or /[A-z]{2}\d{3}/

i.e. aa123 or ZZ443 etc...

A valid serial number is a five digit pattern, or /\d{5}/

13245 or 31234 and so on.

That part isn't the problem. I want combinations and ranges to be valid as well:

12345, ab123,ab234-ab245, 12346 - 12349 - the ultimate goal. Ranges and/or series of part and/or serial numbers in any combination. Note that spaces are optional when specifying a range or after a comma in a series. Note that a range of part numbers has the same two letter combination on both sides of the range (i.e. ab123 - ab239)

I have been wrestling with this expression for two days now, and haven't come up with anything better than this:

/^(?:[A-z]{2}\d{3}[, ]*)|(?:\d{5}[, ]*)|(?:([A-z]{2})\d{3} ?- ?\4\d{3}[, ]*)|(?:\d{5} ?- ?\d{5}[, ]*)$/

...

My Regex-Fu is weak.

+1  A: 

You might not want to do this all with regexes. If you just have a comma-separated list of part/serial numbers, which optionally are ranges, this might be easier:

split input on commas
for each input:
   if there is a dash:
       split on a dash, strip each element to remove whitespace
       make sure each side is a part or a serial number (can use 2 regexes here)
       if they're part numbers, make sure they start w/ the same two letters
   else:
       strip to remove whitespace, make sure is a valid part or serial number

If everything passes, then the input is correct.

Claudiu
hm... definitely an option.
ScottSEA
javascript can do string split, so shouldn't be too bad. additionally the code will be more readable, and if your requirements change for some reason, it should be easier to modify the code than the regex.
Claudiu
+1  A: 

First, [A-z] is wrong. In addition to letters, it will match a square bracket, backslash, caret, underscore or backtick--all the characters that lie between the uppercase letters and lowercase letters in the ASCII character set. You should use either [A-Za-z], or [A-Z] with the case-insensitive option.

To match either a single serial number or a range of serial numbers, do this:

/\d{5}(?:\s*-\s*\d{5})?/

...and for the part numbers:

/([A-Z]{2})\d{3}(?:\s*-\s*\1\d{3})?/i

In your regex you used \4, but that was wrong. It may have been the fourth group that matched the letters in first part number, but it was only the first capturing group, so you should have used \1.

Putting that together to match a whole series, you have

/(?:\b(?:\d{5}(?:\s*-\s*\d{5})?|([A-Z]{2})\d{3}(?:\s*-\s*\1\d{3})?)(?:,\s*)?)+/i

The comma has to be optional, but that means the regex could incorrectly match a sequence like 1234512345 or 12345ab123. Unlikely as that is to happen, I added the word boundary (\b) to cover it. There has to be at least one non-word character between two serial/part numbers/ranges, and (?:,\s*)? means that can only be a comma and optional whitespace. Your [, ]* would allow any number of spaces and/or commas, or nothing at all.

Alan Moore
Fantastic! Initially I hadn't made the groups non-capturing, and forgot to change the backreference to \1 afterward. I really like the way you're thinking here - matching a part/serial with an optional range, rather than the part/serial/part range/serial range path I was heading down. +2 to Awesomeness.
ScottSEA
I did have to add beginning and end of string characters to the regex to get it to work properly, but awesome still. Thanks again.
ScottSEA