ansaurus

Question

Regular Expression matching either a single special character OR a particular sequence

Answer 1

+6 A:

The \S+ bit is greedy, and will match as many non-space characters as possible, including any   that might be there. Change it to it's ungreedy equivalent \S+?, and you'll probably have better luck:

/(https?:\/\/\S+?)(\s| |$)/g;

(Updated because I overlooked the trailing ?.)

molf 2009-06-20 22:58:16

No that just matches only the first non word character now.

apphacker 2009-06-20 23:10:13

I mean non-whitespace character, so it maches http://w and not http:/www.cnn.com for example

apphacker 2009-06-20 23:11:03

@apphacker, OK, that made sense. I overlooked the trailing question mark. I presume you placed it there because you incidentally want to match these URLs at the end of the string? The easiest way out of that is to match either whitspace, or or the end of the string.

molf 2009-06-20 23:14:05

ah sweet, thank you.

apphacker 2009-06-20 23:23:52

Answer 2

A:

The problem is the ? quantifier in (\s|( ))?. Remember that * and ? always succeed because they're allowed to match nothing. In your case, \S+ was gobbling up the non-whitespace, and the optional alternative was matching vacuously.

Change your pattern to /(https?:\/\/\S+)(\s| )/.

As for getting   to appear in your question, you can use "&nbsp" in plain text or ampersand followed by "nbsp;" in backticks.

Yes, \S+ is greedy, but the regular expression engine will backtrack to find a match, which forces the earlier subexpression to give back part of what it matched.

The code below shows that the pattern above is not matching the non-breaking space entity, and produces the following results, where the strings inside the brackets are what the parenthesized subpatterns matched:

no match: http://www.stackoverflow.com/
match: http://www.stackoverflow.com/ [http://www.stackoverflow.com/][ ]
match: http://www.stackoverflow.com/ [http://www.stackoverflow.com/][ ]
no match: https://www.stackoverflow.com/
match: https://www.stackoverflow.com/ [https://www.stackoverflow.com/][ ]
match: https://www.stackoverflow.com/ [https://www.stackoverflow.com/][ ]

Code (but remove the space between the ampersand and "nbsp;"):

function testMatches() {
  var lis = document.getElementsByTagName("li");
  for (i = 0; i < lis.length; i++) {
    var url = lis[i].innerHTML;
    var pattern = /(https?:\/\/\S+)(\s|& nbsp;)/;  // FIXME!
    var m = pattern.exec(url);
    if (m) {
      lis[i].innerHTML = "match: " + url  + " ["
                                   + m[1] + "]["
                                   + m[2] + "]";
    }
    else {
      lis[i].innerHTML = "no match: " + url;
    }
  }
}

Greg Bacon 2009-06-20 23:56:40

But \S+ is still greedy, so it does not actually solve the problem.

molf 2009-06-21 00:07:32

Please be more specific: doesn't solve the problem in what way?

Greg Bacon 2009-06-21 01:05:06

@gbacon: Your regex will not stop at because \S+ is greedy. It will continue to match non-whitespace characters until it finds whitespace.

molf 2009-06-21 09:05:47

@molf: I'm sorry, but you're incorrect: in this case, backtracking forces the greedy quantifier to give up characters it would have matched. See the working example code.

Greg Bacon 2009-06-21 11:28:11

@gbacon: I tried your code. Your regex matches "<url1> <url2><space>" only ONCE, rather than twice for each URL. I don't think this is what the OP is looking for. As I said before, it will greedily match all non-whitespace chars until the next whitespace char is found.

molf 2009-06-21 11:56:27

ansaurus

tags:

views:

answers:

Regular Expression matching either a single special character OR a particular sequence

related questions