tags:

views:

160

answers:

4

Let's say I have two strings like this:

XABY
XBAY

A simple regex that matches both would go like this:

X(AB|BA)Y

However, I have a case where A and B are complicated strings, and I'm looking for a way to avoid having to specify each of them twice (on each side of the |). Is there a way to do this (that presumably is simpler than having to specify them twice)?

Thanks

A: 

If there are serveral strings, with any kind of characters in there, you'll be better with:

X(.)+Y

Only numbers then

X([0-9])+Y

Only letters

X([a-zA-Z])+Y

Letters and numbers

X([a-zA-Z][0-9])+Y
Ben
No, A for example, the 'A' part that I use in my example is actually (s=\s*(?<D>\d*\.?\d*)\s+ and the 'B' part is actually r=\s*(?<E>\d)(/(?<F>\d))?) . That's why doing AB|AB gets kind of hard to maintain.
Jimmy
+2  A: 
X(?:A()|B()){2}\1\2Y

Basically, you use an empty capturing group to check off each item when it's matched, then the back-references ensure that everything's been checked off.

Be aware that this relies on undocumented regex behavior, so there's no guarantee that it will work in your regex flavor--and if it does, there's no guarantee that it will continue to work as that flavor evolves. But as far as I know, it works in every flavor that supports back-references.

EDIT: You say you're using named groups to capture parts of the match, which adds a lot of visual clutter to the regex, if not real complexity. Well, if you happen to be using .NET regexes, you can still use simple numbered groups for the "check boxes". Here's a simplistic example that finds and picks apart a bunch of month-day strings without knowing their internal order:

  Regex r = new Regex(
    @"(?:
        (?<MONTH>Jan|Feb|Mar|Apr|May|Jun|Jul|Sep|Oct|Nov|Dec)()
        |
        (?<DAY>\d+)()
      ){2}
      \1\2",
    RegexOptions.IgnorePatternWhitespace);

  string input = @"30Jan Feb12 Mar23 4Apr May09 11Jun";
  foreach (Match m in r.Matches(input))
  {
    Console.WriteLine("{0} {1}", m.Groups["MONTH"], m.Groups["DAY"]);
  }

This works because in .NET, the presence of named groups has no effect on the ordering of the non-named groups. Named groups have numbers assigned to them, but those numbers start after the last of the non-named groups. (I know that seems gratuitously complicated, but there are good reasons for doing it that way.)

Normally you want to avoid using named and non-named capturing groups together, especially if you're using back-references, but I think this case could be a legitimate exception.

Alan Moore
It works! Because my A and B expressions contain a whole bunch of groups of their own, I used named empty capturing groups:X(?:A(?<dummy1>)|B(?<dummy2>)){2}\k<dummy2>\k<dummy1>Y
Jimmy
@Jimmy: see my edit about the named groups.
Alan Moore
Thanks for the additional suggestion. I am using .net, but my A and B strings contain a bunch of un-named capturing groups too (when I create the regex I use RegexOptions.ExplicitCapture). I think using ?: in all these groups creates more clutter than using two named 'dummy' groups. Additional comments welcome :)
Jimmy
A: 

You can store regex pieces in variables, and do:

A=/* relevant regex pattern */
B=/* other regex pattern */
regex = X($A$B|$B$A)Y

This way you only have to specify each regex once, on its own line, which should make it easier to maintain.

Sidenote: You're trying to find permutations, which is ok since you're only looking at 2 subregexes. But if you wanted to add a third (or fourth), your regex permutations grow drastically - (abc|acb|bac|bca|cab|cba) - or worse. If you need to go down the road of permutations, there's some good discussion on that here on stackoverflow. It's for letter permutation, and the solutions use awk/bash/perl, but that at least gives you a starting point.

Tim
Sorry, this is not too useful. I use a regex editor to test out the regex, and that's the part that gets unwieldy when repeating complex regex... storing the regex in parts in my program wouldn't help in this case.
Jimmy
A: 

try this

X((A|B){2})Y
lzyy
Sorry, that wouldn't work -- I don't want to allow strings like AA or BB -- only AB or BA :)
Jimmy