views:

546

answers:

2

Not sure if I'm doing this right:

/(https?:\/\/\S+)(\s|(& nbsp;))?/g;

This should match a URL beginning with http(s):// and ending with a space character or a & nbsp;

So the problem is this part:

(\s|(& nbsp;))?

That should mean: match either a white space or a & nbsp; but it doesn't work. It never matches for a & nbsp; and just continues until it finds a white space.

I'm not looking for any other http regexp, I'm not looking for a javascript library solution, I'm happy with this, I just want to figure out that last portion.

Edit: some kind of bug in the code formatting on this site, there isn't a space between & and nbsp; but this site turns it into a space if I get rid of that separating space.

+6  A: 

The \S+ bit is greedy, and will match as many non-space characters as possible, including any   that might be there. Change it to it's ungreedy equivalent \S+?, and you'll probably have better luck:

/(https?:\/\/\S+?)(\s| |$)/g;

(Updated because I overlooked the trailing ?.)

molf
No that just matches only the first non word character now.
apphacker
I mean non-whitespace character, so it maches http://w and not http:/www.cnn.com for example
apphacker
@apphacker, OK, that made sense. I overlooked the trailing question mark. I presume you placed it there because you incidentally want to match these URLs at the end of the string? The easiest way out of that is to match either whitspace, or or the end of the string.
molf
ah sweet, thank you.
apphacker
A: 

The problem is the ? quantifier in (\s|( ))?. Remember that * and ? always succeed because they're allowed to match nothing. In your case, \S+ was gobbling up the non-whitespace, and the optional alternative was matching vacuously.

Change your pattern to /(https?:\/\/\S+)(\s| )/.

As for getting   to appear in your question, you can use "&nbsp" in plain text or ampersand followed by "nbsp;" in backticks.

Yes, \S+ is greedy, but the regular expression engine will backtrack to find a match, which forces the earlier subexpression to give back part of what it matched.

The code below shows that the pattern above is not matching the non-breaking space entity, and produces the following results, where the strings inside the brackets are what the parenthesized subpatterns matched:

Code (but remove the space between the ampersand and "nbsp;"):

function testMatches() {
  var lis = document.getElementsByTagName("li");
  for (i = 0; i < lis.length; i++) {
    var url = lis[i].innerHTML;
    var pattern = /(https?:\/\/\S+)(\s|& nbsp;)/;  // FIXME!
    var m = pattern.exec(url);
    if (m) {
      lis[i].innerHTML = "match: " + url  + " ["
                                   + m[1] + "]["
                                   + m[2] + "]";
    }
    else {
      lis[i].innerHTML = "no match: " + url;
    }
  }
}
Greg Bacon
But \S+ is still greedy, so it does not actually solve the problem.
molf
Please be more specific: doesn't solve the problem in what way?
Greg Bacon
@gbacon: Your regex will not stop at   because \S+ is greedy. It will continue to match non-whitespace characters until it finds whitespace.
molf
@molf: I'm sorry, but you're incorrect: in this case, backtracking forces the greedy quantifier to give up characters it would have matched. See the working example code.
Greg Bacon
@gbacon: I tried your code. Your regex matches "<url1> <url2><space>" only ONCE, rather than twice for each URL. I don't think this is what the OP is looking for. As I said before, it will greedily match all non-whitespace chars until the next whitespace char is found.
molf