views:

54

answers:

1

I've got the following regex that was working perfectly until a new situation arose

^.*[?&]U(?:RL)?=(?<URL>.*)$

Basically, it's used against URLs, to grab EVERYTHING after the U=, or URL= and return it in the URL match

So, for the following

http://localhost?a=b&amp;u=http://otherhost?foo=bar

URL = http://otherhost?foo=bar

Unfortunately an odd case came up

http://localhost?a=b&amp;u=http://otherhost?foo=bar&amp;url=http://someotherhost

Ideally, I want URL to be "http://otherhost?foo=bar&amp;url=http://someotherhost", instead, it is just "http://someotherhost"

EDIT: I think this fixed it...though it's not pretty

^.*[?&](?<![?&]U(?:RL)?=.*)U(?:RL)?=(?<URL>.*)$
+6  A: 

The issue

The problem is not that .* is not being greedy enough; it's that the other .* that appears earlier is also greedy.

To illustrate the issue, let's consider a different example. Consider the following two patterns; they're identical, except in reluctance of \1 in second pattern:

              \1 greedy, \2 greedy         \1 reluctant, \2 greedy
              ^([0-5]*)([5-9]*)$           ^([0-5]*?)([5-9]*)$

Here we have two capturing groups. \1 captures [0-5]*, and \2 captures [5-9]*. Here's a side-by-side comparison of what these patterns match and capture:

              \1 greedy, \2 greedy          \1 reluctant, \2 greedy
              ^([0-5]*)([5-9]*)$            ^([0-5]*?)([5-9]*)$
Input         Group 1    Group 2            Group 1    Group 2
54321098765   543210     98765              543210     98765
007           00         7                  00         7
0123456789    012345     6789               01234      56789
0506          050        6                  050        6
555           555        <empty>            <empty>    555
5550555       5550555    <empty>            5550       555

Note that as greedy as \2 is, it can only grab what \1 didn't already grab first! Thus, if you want to make \2 grab as many 5 as possible, you have to make \1 reluctant, so the 5 is actually up for grab by \2.

Attachments

Related questions


The fix

So applying this to your problem, there are two ways that you can fix this: you can make the first .* reluctant, so (see on rubular.com):

^.*?[?&]U(?:RL)?=(?<URL>.*)$

Alternatively you can just get rid of the prefix matching part altogether (see on rubular.com):

[?&]U(?:RL)?=(?<URL>.*)$
polygenelubricants
Beat me to it :-)
Nate
Reluctant? I usually call it non-greedy. So indeed, the first `.*` is too greedy.
MvanGeest
Or just remove that `^.*`.
KennyTM
@MvanGeest: I learned regex through Java, and that's what they call it (http://java.sun.com/docs/books/tutorial/essential/regex/quant.html).
polygenelubricants
@polygenelubricants: OK, I met regexes in Perl, and Learning Perl uses non-greedy. It's also in a very old Perl FAQ: http://www.perl.com/doc/FAQs/FAQ/oldfaq-html/Q1.3.html But yeah, who cares what they're called? (A person searching the web desperately, of course...)
MvanGeest
Nice edit and further explanation, I'd upvote and accept this answer again if I could
Chad