tags:

views:

385

answers:

4

Consider a string that looks like this:

RR1 S5 C92

This a rural route address for out-of-town mail delivery: Rural Route, Site, Compartment. Each letter is followed by a number and a space. Usually one to three digits long, but you never know how many numbers it could be! If the user is lazy, they may have entered zero, one or many spaces.

Question: What regex would YOU use to determine if a given string matches this pattern?

Its usage would be something like this:

string ruralPattern; //a regex pattern here
bool isRural = Regex.Match(someString, ruralPattern);

Update: Thank you for your suggestions! Performance and usage will be within a static method in an assembly to be called from a web service. The strings being checked against this pattern will be max 50 characters. The method will be called roughly once every 5 seconds. Any suggestions on keeping it static? Much appreciated!

+6  A: 

This should work:

^[Rr][Rr]\d+ *[Ss]\d+ *[Cc]\d+$

or as per other comment

^[Rr][Rr][0-9]+ *[Ss][0-9]+ *[Cc][0-9]+$

What it all means:

  • ^ - start of string
  • [Rr] - next char must be a R or r
  • [Rr] - next char must be a R or r
  • \d+ or [0-9]+ - next part must be 1 or more digits
  • (space)* - allow for 0 or more spaces
  • [Ss] - next char must be a S or s
  • \d+ or [0-9]+ - next part must be 1 or more digits
  • (space)* - allow for 0 or more spaces
  • [Cc] - next char must be a C or c
  • \d+ or [0-9]+ - next part must be 1 or more digits
  • $ - end of string

There might be a more elegant solution but this is pretty easy to read.

Edit: Updated to include input from some of the comments

Kelsey
Simplicity is a good thing with regex's.
Ron Warholic
Definately... I wish more people would break down their solutions as I have above to make them easier to understand since regex is not the most readable syntax.
Kelsey
+3  A: 

How about...

someString = someString.Trim(); // eliminate leading/trailing whitespace
bool isRural = Regex.Match(
   someString,
   @"^rr\d+\s*s\d+\s*c\d+$",
   RegexOptions.IgnoreCase);

This eliminates the uppercase/lowercase switching within the pattern and uses \s to allow any (non-newline) whitespace character (e.g. tabs). If you want spaces only, then '\s' should be changed to ' '.

bobbymcr
+1, this is the simplest and most correct answer yet, **but**, be aware that `\d` matches more than just `[0-9]`. It matches any character for which char.IsDigit returns true, which by my count includes some **230** unicode code points.
P Daddy
Yeah, that's true, and a similar claim can be made for `\s` (`char.IsWhiteSpace`).
bobbymcr
@P - thanks for the insight on the `\d`!
p.campbell
+1  A: 

Let's clear up the following presumptions:

  1. There three sections to the string.
  2. section 1 always start with RR uppercase or lowercase and ends with one or more decimal digits.
  3. section 2 always start with S uppercase or lowercase and ends with one or more decimal digits.
  4. section 3 always start with C upper or lower and ends with one or more decimal digits.

For simplicity, the following would suffice.

[Rr][Rr][0-9]+[ ]+[Ss][0-9]+[ ]+[Cc][0-9]+
  1. [Rr] means exactly one alphabet R, upper or lower case.
  2. [0-9] means exactly one decimal digit.
  3. [0-9]+ means at least one, or more, decimal digits.
  4. [ ]+ means at least one, or more, spaces.

However, to be useful, normally, when you use regex, we would also detect individual sections to exploit the matching capability to help us assign individual section values to their respective/individual variables.

Therefore, the following regex is more helpful.

([Rr][Rr][0-9]+)[ ]+([Ss][0-9]+)[ ]+([Cc][0-9]+)

Let's apply that regex to the string

string inputstr = "Holy Cow RR12 S53 C21";

This is what your regex matcher would let you know:

start pos=9, end pos=21
Group(0) = Rr12 S53 C21
Group(1) = Rr12
Group(2) = S53
Group(3) = C21

There are three pairs of elliptical/round brackets. Each pair is a section of the string, which the regex compiler calls a group.

The regex compiler would call the match of

  1. the whole matched string as group 0
  2. rural route as group 1
  3. site as group 2 and
  4. compartment as group 3.

Naturally, groups 1, 2 & 3 will encounter matches, if and only if group 0 has a match.

Therefore, your algorithm would exploit that with the following pseudocode

string postalstr, rroute, site, compart;
if (match.group(0)!=null)
{
  int start = match.start(0);
  int end = match.end(0);
  postalstr = inputstr.substring(start, end);

  start = match.start(1);
  end = match.end(1);
  rroute = inputstr.substring(start, end);

  start = match.start(2);
  end = match.end(2);
  site = inputstr.substring(start, end);

  start = match.start(3);
  end = match.end(3);
  compart = inputstr.substring(start, end);
}

Further, you may want to enter into a database table with the columns: rr, site, compart, but you only want the numerals entered without the alphabets "rr", "s" or "c". This would be the regex with nested grouping to use.

([Rr][Rr]([0-9]+))[ ]+([Ss]([0-9]+))[ ]+([Cc]([0-9]+))

And the matcher will let you know the following when a match occurs for group 0:

start=9, end=21
Group(0) = Rr12 S53 C21
Group(1) = Rr12
Group(2) = 12
Group(3) = S53
Group(4) = 53
Group(5) = C21
Group(6) = 21
Blessed Geek
A: 

FYI: If you're going to be using this RegEx to test a lot of data, your best bet would be to tell .NET to precompile it - it will be compiled into IL and grant a performance boost, rather than simply interpreting the RegEx pattern each time. Specify it as a static member on whichever class contains your method, like so:

private static Regex re = new Regex("pattern", RegexOptions.Compiled | RegexOptions.IgnoreCase);

...and the method to test whether a string matches the pattern is...

bool matchesString = re.IsMatch("string");

Good luck.

Tullo
*Maybe*. `RegexOptions.Compiled` isn't always a win, and profiling is necessary. See: http://www.codinghorror.com/blog/archives/000228.html and http://stackoverflow.com/questions/414328/using-static-regex-ismatch-vs-creating-an-instance-of-regex/414411#414411
P Daddy
Thanks Tullo and PDaddy. An update in the question around the expected usage!
p.campbell