ansaurus

Question

Why does my non-greedy Perl regex still match too much?

Answer 1

+17 A:

The problem is that, even though it's not greedy, it still keeps trying. The regex doesn't see

"$tom" said blah blah blash.

and think "Oh, the stuff following the "said" isn't quoted, so I'll skip that one." It thinks "well, the stuff after "said" isn't quoted, so it must still be part of our quote." So ".+?" matches

"$tom" said blah blah blash.  "$dick"

What you want is "[^"]+". This will match two quote marks enclosing anything that's not a quote mark. So the final solution:

("[^"]+" said "[^"]+")

Chris Lutz 2009-10-21 05:57:21

Thanks Chris, for the solution and the explanation :) Thanks a lot for the englightenment!

Mike 2009-10-21 06:05:02

++! Another good reason to use `[^"]+` is that it will cut down on unnecessary back-tracking and make your regex more efficient.

daotoad 2009-10-21 06:18:18

@daotoad - I discussed this with Alex Martelli in his answer. This is often cited, but even for 10,000,000 comparisons the performance difference is still almost unnoticeable, both in Perl and in Python. It seems the mantra changes from "the compiler will optimize that" to "the regex engine will optimize that." :P (See my comment to his answer for the results of my timings.)

Chris Lutz 2009-10-21 06:31:40

@Chris Lutz: I think daotoad was comparing `"[^"]+"` to `".*+"`, not to `"[^"]+?"`.

ysth 2009-10-21 09:06:41

@Chris,@daotaod,@ysth, I have another question. I changed the string to ""$tom" said blah blah blash. "$dick" said "blah blah blah". "$harry" said blah blah blah. "$jane" said "blah blah blah"." Then I tried "print $term" but I only get "$dick" said "blah blah blah"? Why ""$jane" said "blah blah blah" is not captured? Thanks!

Mike 2009-10-21 09:38:09

I already used the g modifier: "my ($term) = /("[^"]+" said "[^"]+")/g; print $term \n;" but things are not right. What am I doing wrong?

Mike 2009-10-21 09:41:00

Okiedokie, the following code worked fine but I still want to know why "print $term" only prints the first capture : "my @group = ($_=~ m/("[^"]+" said "[^"]+")/g); print "@group \n";

Mike 2009-10-21 10:15:07

Well, I tried the following code and it worked. But it's quite inconvenient : my ($first,$second) = /("[^"]+" said "[^"]+")/g; print $first;print $second;

Mike 2009-10-21 10:23:32

Because `my ($term) = something` assigns the return of `something` (as a list) to the list `($term)` and since that list only has one item to assign to, Perl silently discards the excess. The correct way to capture an arbitrary number of matches with the `/g` modifier is to use a list like `@group` so that it always has enough room. You can also use the `/g` modifier in a `while` loop to iterate through matches, and then perform an action for each. Or you can use `map` for the same goal: `map { do stuff } (/regex/g);`

Chris Lutz 2009-10-21 21:31:34

@Chris, thanks for the clarification. I already figured out how to use @group to capture more than one match but just didn't understand why my ($term) construct failed. Now I see the point. It's different from $_=~ construct. Thanks :)

Mike 2009-10-22 00:54:11

Answer 2

+3 A:

Unfortunately " is a peculiar-enough character to need to be treated carefully. Use:

my ($term) = /("[^"]+?" said "[^"]+?")/g;

and it should work fine (it does for me...!). I.e. explicitly match sequences of "nondoublequotes" rather than sequences of arbitrary characters.

Alex Martelli 2009-10-21 06:00:16

Woah! The Lord of Python answers a Perl question! Pinch me, I'm dreaming. (As a note, the non-greedy match isn't necessary anymore, and will probably just slow the regex engine down. But this does work, as you say.)

Chris Lutz 2009-10-21 06:02:08

I was an excellent Perl'er back in Perl 4 time (and this question's quite germane to Perl 4;-) -- it's the weird novelties of Perl 5 that threw me off and eventually made a Pythonista out of me!-) (Back in "Perl 4" times, non-greedy was never less efficient than greedy -- haven't looked at any _current_ Perl RE engines to see if they've spoiled THOSE, too, though!-).

Alex Martelli 2009-10-21 06:07:07

Grazie, Alex :)

Mike 2009-10-21 06:14:45

There's no extreme difference, but after testing I've found that the greedy `"[^"]+"` is _slightly_ faster than `"[^"]+?"`. It's not a significant difference: 7.13 seconds for the greedy version, 7.68 for the non-greedy (for 20,000,000 tests). It's doubtful that this insignificant of a difference will ever affect real-life code, but it is good to note. (For completeness, in Python the greedy one takes 15.15 seconds and the non-greedy one takes 16.47 seconds. Same data set.)

Chris Lutz 2009-10-21 06:29:31

Answer 3

+2 A:

Others have mentioned how to fix this.

I'll answer how you can debug this: you can see what's happening by using more captures:

 bash$ cat story | perl -nle 'my ($term1, $term2, $term3) = /(".+?") (said) (".+?")/g ; 
      print "term1 = \"$term1\" term2 = \"$term2\" term3 = \"$term3\" \n"; '
 term1 = ""$tom" said blah blah blash.  "$dick"" term2 = "said" term3 = ""blah blah blah""

Stobor 2009-10-21 06:01:04

Answer 4

+1 A:

Your problem here is that there are two possible matches for your regexp, the one you want (a shorter one) and the one the regex engine chooses. The engine chooses that specific match because it prefers a match that starts earlier in the string and is longer to a match that starts later and is shorter. In other words, early matches win over shorter ones.

To solve this you need to make your regex more specific (as in telling the engine that $term should not contain any quotes. It's a good idea to make your regexes as specific as possible anyway.

For more details and gotchas regarding regular expressions, I recommend Jeffrey Friedl's excellent book: Mastering Regular Expressions

kixx 2009-10-25 21:27:27

@kixx, thanks for the explanation and the book recommendation.

Mike 2009-10-26 00:44:25

ansaurus

tags:

views:

answers:

Why does my non-greedy Perl regex still match too much?

related questions