tags:

views:

290

answers:

4

Say, I have a line that contains the following string:

"$tom" said blah blah blash.  "$dick" said "blah blah blah". "$harry" said blah blah blah.

and I want to extract

"$dick" said "blah blah blah"

I have the following code:

my ($term) = /(".+?" said ".+?")/g;
print $term;

But it gives me more than I need:

"$tom" said blah blah blash.  "$dick" said "blah blah blah"

I tried grouping my pattern as a whole by using the non-capturing parens:

my ($term) = /((?:".+?" said ".+?"))/g;

But the problem persists.

I've reread the Nongreedy Quantifiers section of Learning Perl but it's got me nowhere so far.

Thanks for any guidance you can generously offer :)

+17  A: 

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")
Chris Lutz
Thanks Chris, for the solution and the explanation :) Thanks a lot for the englightenment!
Mike
++! Another good reason to use `[^"]+` is that it will cut down on unnecessary back-tracking and make your regex more efficient.
daotoad
@daotoad - I discussed this with Alex Martelli in his answer. This is often cited, but even for 10,000,000 comparisons the performance difference is still almost unnoticeable, both in Perl and in Python. It seems the mantra changes from "the compiler will optimize that" to "the regex engine will optimize that." :P (See my comment to his answer for the results of my timings.)
Chris Lutz
@Chris Lutz: I think daotoad was comparing `"[^"]+"` to `".*+"`, not to `"[^"]+?"`.
ysth
@Chris,@daotaod,@ysth, I have another question. I changed the string to ""$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah. "$jane" said "blah blah blah"." Then I tried "print $term" but I only get "$dick" said "blah blah blah"? Why ""$jane" said "blah blah blah" is not captured? Thanks!
Mike
I already used the g modifier: "my ($term) = /("[^"]+" said "[^"]+")/g; print $term \n;" but things are not right. What am I doing wrong?
Mike
Okiedokie, the following code worked fine but I still want to know why "print $term" only prints the first capture : "my @group = ($_=~ m/("[^"]+" said "[^"]+")/g); print "@group \n";
Mike
Well, I tried the following code and it worked. But it's quite inconvenient : my ($first,$second) = /("[^"]+" said "[^"]+")/g; print $first;print $second;
Mike
Because `my ($term) = something` assigns the return of `something` (as a list) to the list `($term)` and since that list only has one item to assign to, Perl silently discards the excess. The correct way to capture an arbitrary number of matches with the `/g` modifier is to use a list like `@group` so that it always has enough room. You can also use the `/g` modifier in a `while` loop to iterate through matches, and then perform an action for each. Or you can use `map` for the same goal: `map { do stuff } (/regex/g);`
Chris Lutz
@Chris, thanks for the clarification. I already figured out how to use @group to capture more than one match but just didn't understand why my ($term) construct failed. Now I see the point. It's different from $_=~ construct. Thanks :)
Mike
+3  A: 

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

Alex Martelli
Woah! The Lord of Python answers a Perl question! Pinch me, I'm dreaming. (As a note, the non-greedy match isn't necessary anymore, and will probably just slow the regex engine down. But this does work, as you say.)
Chris Lutz
I was an excellent Perl'er back in Perl 4 time (and this question's quite germane to Perl 4;-) -- it's the weird novelties of Perl 5 that threw me off and eventually made a Pythonista out of me!-) (Back in "Perl 4" times, non-greedy was never less efficient than greedy -- haven't looked at any _current_ Perl RE engines to see if they've spoiled THOSE, too, though!-).
Alex Martelli
Grazie, Alex :)
Mike
There's no extreme difference, but after testing I've found that the greedy `"[^"]+"` is _slightly_ faster than `"[^"]+?"`. It's not a significant difference: 7.13 seconds for the greedy version, 7.68 for the non-greedy (for 20,000,000 tests). It's doubtful that this insignificant of a difference will ever affect real-life code, but it is good to note. (For completeness, in Python the greedy one takes 15.15 seconds and the non-greedy one takes 16.47 seconds. Same data set.)
Chris Lutz
+2  A: 

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""
Stobor
+1  A: 

Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.

To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.

For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions

kixx
@kixx, thanks for the explanation and the book recommendation.
Mike