tags:

views:

4330

answers:

3

I suspect this has already been answered somewhere, but I can't find it, so...

I need to extract a string from between two tokens in a larger string, in which the second token will probably appear again meaning... (pseudo code...)

myString = "A=abc;B=def_3%^123+-;C=123;"  ;

myB = getInnerString(myString, "B=", ";" )  ;

method getInnerString(inStr, startToken, endToken){
   return inStr.replace( EXPRESSION, "$1");
}

so, when I run this using expression ".+B=(.+);.+" I get "def_3%^123+-;C=123;" presumably because it just looks for the LAST instance of ';' in the string, rather than stopping at the first one it comes to.

I've tried using (?=) in search of that first ';' but it gives me the same result.

I can't seem to find a regExp reference that explains how one can specify the "NEXT" token rather than the one at the end.

any and all help greatly appreciated.


Similar question on SO:

+3  A: 

Try this:

B=([^;]+);

This matches everything between B= and ; unless it is a ;. So it matches everything between B= and the first ; thereafter.

Gumbo
+5  A: 

You're using a greedy pattern by not specifying the ? in it. Try this:

".+B=(.+?);.+"
Evan Fosmark
Thanks! works like a charm, though having read description for '?' I'm not sure I see why it would produce said effect.
Dr.Dredel
Dr.Dredel, it makes it match as few # of characters as possible. Without it, it matches as many as possible (making it greedy because it takes so much).
Evan Fosmark
But it takes a lot of backtracking.
Gumbo
No, non-greedy quantifiers *eliminate* backtracking by doing some extra work up front.
Alan Moore
And what regex implementation does this? I ran a test in RegexBuddy with all regex flavors and everyone had to backtrack and needed 82 steps to find a match.
Gumbo
It's the .+ at the beginning of the regex that's causing all the backtracking. But that, and the one at the end, only need to be there because the OP is doing a 'replace' when he should be doing a 'find'.
Alan Moore
Alan M, what do you mean I *should be doing a find? I need the rest of stuff in the string (the start and end around the substring I'm in need of) to go away... I'm not clear on what you're saying.
Dr.Dredel
No, you shouldn't need to match those parts of the string. I'll have to post a separate answer to explain (I'll hit the backtracking thing, too).
Alan Moore
A: 

(This is a continuation of the conversation from the comments to Evan's answer.)

Here's what happens when your (corrected) regex is applied: First, the .+ matches the whole string. Then it backtracks, giving up most of the characters it just matched until it gets to the point where the B= can match. Then the (.+?) matches (and captures) everything it sees until the next part, the semicolon, can match. Then the final .+ gobbles up the remaining characters.

All you're really interested in is the "B=" and the ";" and whatever's between them, so why match the rest of the string? The only reason you have to do that is so you can replace the whole string with the contents of the capturing group. But why bother doing that if you can access contents of the group directly? Here's a demonstration (in Java, because I can't tell what language you're using):

String s = "A=abc;B=def_3%^123+-;C=123;";

Pattern p = Pattern.compile("B=(.*?);");
Matcher m = p.matcher(s);
if (m.find())
{
  System.out.println(m.group(1));
}

Why do a 'replace' when a 'find' is so much more straightforward? Probably because your API makes it easier; that's why we do it in Java. Java has several regex-oriented convenience methods in its String class: replaceAll(), replaceFirst(), split(), and matches() (which returns true iff the regex matches the whole string), but not find(). And there's no convenience method for accessing capturing groups, either. We can't match the elegance of Perl one-liners like this:

print $1 if 'A=abc;B=def_3%^123+-;C=123;' =~ /B=(.*?);/;

...so we content ourselves with hacks like this:

System.out.println("A=abc;B=def_3%^123+-;C=123;"
    .replaceFirst(".+B=(.*?);.+", "$1"));

Just to be clear, I'm not saying not to use these hacks, or that there's anything wrong with Evan's answer--there isn't. I just think we should understand why we use them, and what trade-offs we're making when we do.

Alan Moore
I am using java and am infuriated by how sad its regex options are compared to perl. 2 other infuriating weaknesses are the absence of the last / option (where you stick the i,g etc.) and instead having to run some bizarro .IGNORE_CASE constant, or the obscenely
Dr.Dredel
ugly necessity to escape all your '\' with additional \, making the (already difficult to examine with human eyes) regex MUCH harder to look at. Not to mention that if you need to run a string through multiple regex, there's a good chance that the resulting string will lose one level of '\'.
Dr.Dredel
I admit that I'm brand new to using Regex in Java, but I note in your comment the reference to the lack of elegance of its use in perl (with which I am familiar) and am inclined to completely agree.lastly,
Dr.Dredel
from code design standpoint, the original example offered by Evan is a lot prettier, albeit more wasteful cycle wise.
Dr.Dredel
The transition from Perl to Java is bound to be painful anyway, Java being so much more rigid and verbose. Just try to accept it on its own terms. As for the modifiers, I almost never use IGNORE_CASE and such; just stick (?i) at the beginning of the regex itself.
Alan Moore