tags:

views:

563

answers:

3

I'm trying to write a regular expression that surrounds "http" URLs with angle brackets, except for lines beginning with two slashes. The best I've come up with is:

s#^(?!//)(.*?)(http://[^\s]+)#$1<$2>#gm;

This works great for these two:


Input: http://a.com

Output: <http://a.com>


Input: //http://a.com

Output: //http://a.com


However, it fails here:


Input: http://a.com http://b.com

Actual Output: <http://a.com> http://b.com

Desired Output: <http://a.com> <http://b.com>


Why doesn't my regular expression keep matching? Am I using /g wrong?

+4  A: 

You should really use two regexes; one to identify the "commented-out" lines and one to modify the http's in the regular lines.

There might be a non-standard way to combine the two regexes or replace all of your multiple (http...)+ matches, but I wouldn't use them.

aib
The regex is fed into a legacy function that operates on a big, multi-line blob of text. I wish I could split it into lines and do what you say, but that would require major regression testing.
mike
Major refactoring and regression testing, I should say.
mike
@Mike - if you need to match the beginning of multiple lines, consider the 'm' modifier. It causes ^ and $ to match the beginning or end of any line.
Chris Lutz
Oh, in practice I do -- somehow that got wiped out when I was turning it into an SO question.
mike
Ah, the joy of working with legacy code :)
aib
+3  A: 

You can't really do this for an indefinite number of expressions. Try this:

s#(http://[^\s]+)#&lt;$1&gt;#g unless m#^//#;

This will replace all of the URLs in the line, but only if the first two characters of the line aren't "//". Sure, it's a little more complicated, but it works (I think).

EDIT: My answer is the same as aib's, but I have code.

Chris Lutz
+3  A: 

rewriting it a little...with my suggestions and using the whitespace modifier so it's actually readable. :)

s{
    (?:^|\G)     # start of the last match, so you never backtrack and don't capture.
    (?!//)       # a section without //
    (.*?)        # followed by anything
    (
        http://  # with http://
        [^\s]+   # and non-spaces - you could also use \S
    )
 }
 {$1<$2>}xmg;

Trying this in perl, we get:

sub test {
    my ($str, $expect) = @_;
    my $mod = $str;
    $mod =~ s{
            (?:^|\G)       # start of the last match, so you never backtrack.
            (?!//)       # a section without //
            (.*?)        # followed by anything
            (
                http://  # with http://
                [^\s]+   # and non-spaces - you could also use \S
            )
          }
          {$1<$2>}xmg;
    print "Expecting '$expect' got '$mod' - ";
    print $mod eq $expect ? "passed\n" : "failed\n";
}

test("http://foo.com",    "<http://foo.com&gt;");
test("// http://foo.com", "// http://foo.com");
test("foo\nhttp://a.com","foo\n&lt;http://a.com&gt;");

# output is 
# Expecting '<http://foo.com&gt;' got '<http://foo.com&gt;' - passed
# Expecting '// http://foo.com' got '// http://foo.com' - passed
# Expecting 'foo
# <http://a.com&gt;' got 'foo
# <http://a.com&gt;' - passed

Edit: Couple of changes: Added the 'm' modifier to make sure that it matches from the start of a line, and change \G to (^|\G) to make sure it starts looking at the start of the line too.

Robert P
That's really really good, and I might be able to figure out the last little problem on my own, but of course any input is appreciated: In practice it also has a /m modifier, since it operates on a big blob of text. This causes it to fail on "foo\nhttp://a.com"
mike
...which should return "foo\n<http://a.com>" but actually returns "foo\nhttp://a.com"
mike
In fact, I'm going to accept your answer anyway, since it's perfect for the question as originally asked.
mike
ok, sure. I'll update the answer.
Robert P
Hey, changing your \G to (^|\G) and your $1<2> to $2<3> seems to work!
mike
ah yeah :) Figured that one out right as I was updating the question... give me a bit and I'll add it. :)
Robert P
Also, I made the first group a non-capturing group. That way it's clear to others that you really don't care what the first part is.
Robert P