tags:

views:

86

answers:

4

Is it possible for a regex to match based on other parts of the same regex?

For example, how would I match lines that begins and end with the same sequence of 3 characters, regardless of what the characters are?

Matches:

abcabc
xyz abc xyz

Doesn't Match:

abc123

Undefined: (Can match or not, whichever is easiest)

ababa
a

Ideally, I'd like something in the perl regex flavor. If that's not possible, I'd be interested to know if there are any flavors that can do it.

+8  A: 

You need backreferences. The idea is to use a capturing group for the first bit, and then refer back to it when you're trying to match the last bit. Here's an example of matching a pair of HTML start and end tags (from the link given earlier):

<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\1>

This regex contains only one pair of parentheses, which capture the string matched by [A-Z][A-Z0-9]* into the first backreference. This backreference is reused with \1 (backslash one). The / before it is simply the forward slash in the closing HTML tag that we are trying to match.

Applying this to your case:

/^(.{3}).*\1$/

(Yes, that's the regex that Brian Carper posted. There just aren't that many ways to do this.)

A detailed explanation for posterity's sake (please don't be insulted if it's beneath you):

  • ^ matches the start of the line.
  • (.{3}) grabs three characters of any type and saves them in a group for later reference.
  • .* matches anything for as long as possible. (You don't care what's in the middle of the line.)
  • \1 matches the group that was captured in step 2.
  • $ matches the end of the line.
Michael Myers
+4  A: 

Use capture groups and backreferences.

/^(.{3}).*\1$/

The \1 refers back to whatever is matched by the contents of the first capture group (the contents of the ()). Regexes in most languages allow something like this.

Brian Carper
Huh, I've actually been using capture groups and back references for years in the replace part of find/replace. I never once thought I might be able to use them in the original match pattern too.
Whatsit
+3  A: 

For the same characters at the beginning and end:

/^(.{3}).*\1$/

This is a backreference.

cletus
+1  A: 

This works:

my $test = 'abcabc';
print $test =~ m/^([a-z]{3}).*(\1)$/;

For matching the beginning and the end you should add ^ and $ anchors.

Peter Stuifzand