tags:

views:

363

answers:

3

Hi,

I've been googling & trying to get this myself but can't quite get it...

QUESTION: What regular expression could be used to select text BETWEEN (but not including) the delimiter text. So as an example:

Start Marker=ABC
Stop Marker=XYZ

---input---
This is the first line
And ABCfirst matched hereXYZ
and then
again ABCsecond matchXYZ
asdf
------------

---expected matches-----
[1] first matched here
[2] second match
------------------------

Thanks

+5  A: 

Standard or extended regex syntax can't do that, but what it can do is create match groups which you can then select. For instance:

ABC(.*)XYZ

will store anything between ABC and XYZ as \1 (otherwise known as group 1).

If you're using PCREs (Perl-Compatible Regular Expressions), lookahead and lookbehind assertions are also available -- but groups are the more portable and better-performing solution. Also, if you're using PCREs, you should use *? to ensure that the match is non-greedy and will terminate at the first opportunity.

You can test this yourself in a Python interpreter (the Python regex syntax is PCRE-derived):

>>> import re
>>> input_str = '''
... This is the first line
... And ABC first matched hereXYZ
... and then
... again ABCsecond matchXYZ
... asdf
... '''
>>> re.findall('ABC(.*?)XYZ', input_str)
[' first matched here', 'second match']
Charles Duffy
would the \1 group contain "first matched here" and "second match", or everything between first ABC till last XYZ?
kender
@kender - To have only one match, two things would need to be true: The multiline flag would need to be set, and the asterisk would need to be greedy. Otherwise, we have two separate matches, each of which has its own groups.
Charles Duffy
I'm actually using C#, so is the idea I might be able to get at the groups (e.g. \1 group) in C#?
Greg
@Greg - Absolutely; if you have a Match m, see m.Groups.
Charles Duffy
got it thanks: foreach (Match match in matches) { GroupCollection groups = match.Groups; Console.Out.WriteLine(groups[1]); }
Greg
It isn't the Multiline flag that would need to be set, it's RegexOptions.Singleline (in Python it would be re.DOTALL or re.S).
Alan Moore
+1  A: 

/ABC(.*?)XYZ/

By default, regular expression matches are greedy. The '?' after the . wildcard character, denotes a minimal match, so that the first match is this:

first matched here

...instead of this:

first matched hereXYZ
and then
again ABCsecond match
Sonam Chauhan
@Sonam - Depends on the regex syntax in use -- remember, we have basic, extended, and Perl-Compatible; only the last of those recognizes the question mark as modifying greedy behavior.
Charles Duffy
.? would match zero or one charachter -- you also need * or +
Devin Ceartas
Thanks guys. Yes, of course it should be .*? or .+?... my regex-fu is weak, and its PCRE :)
Sonam Chauhan
A: 

You want the non-greedy match, .*?

while( $string =~ /ABC(.*?)XYZ/gm ) {
  $match = $1;
}
Devin Ceartas
(this perl.. There is a reason so many languages use perl style regex...). ;-)
Devin Ceartas