tags:

views:

359

answers:

4

I was reading this question about how to parse URLs out of web pages and had a question about the accepted answer which offered this solution:

((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)

The solution was offered by csmba and he credited it to regexlib.com. Whew. Credits done.

I think this is a fairly naive regular expression but it's a fine starting point for building something better. But, my question is this:

What is the point of "{1}"? It means "exactly one of the previous grouping", right? Isn't that the default behavior of a grouping in a regular expression? Would the expression be changed in any way if the {1} were removed?

If I saw this from a coworker I would point out his or her error but as I write this the response is rated at a 6 and the expression on regexlib.com is rated a 4 of 5. So maybe I'm missing something?

+1  A: 

I don't think it has any purpose. But because RegEx is almost impossible to understand/decompose, people rarely point out errors. That is probably why no one else pointed it out.

Edit: Why am I downvoted for not being wrong?

Marius
+3  A: 

@Rob: I disagree. To enforce what you are asking for I think you would need to use negative-look-behind, which is possible but is certainly not related to use {1}. Neither version of the regexp address that particular issue.

To let the code speak:

tibook 0 /home/jj33/swap > cat text
Text this is <http://example.com&gt; text this is
Text this is <http://http://example.com&gt; text this is
tibook 0 /home/jj33/swap > cat p
#!/usr/bin/perl

my $re1 = '((mailto\:|(news|(ht|f)tp(s?))\://){1}\S+)';
my $re2 = '((mailto\:|(news|(ht|f)tp(s?))\://)\S+)';

while (<>) {
  print "Evaluating: $_";
  print "re1 saw \$1 = $1\n" if (/$re1/);
  print "re2 saw \$1 = $1\n" if (/$re2/);
}
tibook 0 /home/jj33/swap > cat text | perl p
Evaluating: Text this is <http://example.com&gt; text this is
re1 saw $1 = <http://example.com&gt;
re2 saw $1 = <http://example.com&gt;
Evaluating: Text this is <http://http://example.com&gt; text this is
re1 saw $1 = <http://http://example.com&gt;
re2 saw $1 = <http://http://example.com&gt;
tibook 0 /home/jj33/swap >

So, if there is a difference between the two versions, it's doesn't seem to be the one you suggest.

jj33
+1  A: 

@Jeff Atwood, your interpretation is a little off - the {1} means match exactly once, but has no effect on the "capturing" - the capturing occurs because of the parens - the braces only specify the number of times the pattern must match the source - once, as you say.

I agree with @Marius, even if his answer is a little terse and may come off as being flippant. Regular expressions are tough, if one's not used to using them, and the {1} in the question isn't quite error - in systems that support it, it does mean "exactly one match". In this sense, it doesn't really do anything.

Unfortunately, contrary to a now-deleted post, it doesn't keep the regexp from matching http://http://example.org, since the \S+ at the end will match one or more non-whitespace characters, including the http://example.org in http://http://example.org (verified using Python 2.5, just in case my regexp reading was off). So, the regexp given isn't really the best. I'm not a URL expert, but probably something limiting the appearance of ":"s and "//"s after the first one would be necessary (but hardly sufficient) to ensure good URLs.

Blair Conrad
+1  A: 

I don't think the {1} has any valid function in that regex.

(mailto\:|(news|(ht|f)tp(s?))\://){1}

You should read this as: "capture the stuff in the parens exactly one time". But we don't really care about capturing this for use later, eg $1 in the replacement. So it's pointless.

Jeff Atwood