views:

43

answers:

2

I am trying to match URLs with a tested Regex expression but when I use JavaScript to evaluate it returns false.

Here is my code:

var $regex = new RegExp("<a\shref=\"(\#\d+|(https?|ftp):\/\/[-a-z0-9+&@#\/%?=~_|!:,.;\\(\\)]+)\"(\stitle=\"[^\"<>]+\")?\s?>|<\/a>");

var $test = new Array();
$test[0] = '<a href="http://www.nytimes.com/imagepages/2010/09/02/us/HURRICANE.html"&gt;';
$test[1] = '<a href="http://www.msnbc.msn.com/id/38877306/ns/weather/%29;"&gt;';
$test[2] = '<a href="http://www.msnbc.msn.com/id/38927104" title="dd" alt="dd">';
for(var i = 0; i < $test.length; i++)
{
    console.log($test[i]);
    console.log($regex.test($test[i]));
}

Anyone have any idea what is going on?

+2  A: 

You need to escape backslashes when creating regular expressions with new RegExp() since you pass a string and a backslash is also an escaping character for strings.

new RegExp("\s"); // becomes /s/
new RegExp("\\s"); // becomes /\s/

Or just write your regexp as literals.

var re = /\s/;

Also, if you want to match URL's, why take a whole HTML tag into account? The following regexp would suffice:

var urlReg = /^(?:\#\dhttp|ftp):\/\/[\w\d\.-_]*\/[^\s]*/i;
// anything past the third / that's not a space, is valid.
BGerrissen
I can't believe I overlooked that. Thanks for your help, that was my problem. I've been staring at that expression for far too long trying to figure that out. Much appreciated!
Wade
A: 

There are multiple problems.

You need to escape backslashes. Any character with a special meaning needs to be escaped with a backslash in the regular expression, and the backslash itself needs to be escaped in the string. Effectively, \s should be represented as \\s if you construct it with new Regexp("\\s").

You need to allow more characters in your URLs. Currently you don't even allow / characters. I would propose a character class like [^"] to match everything after http://. (Escaping the " character when used in t a string will make it [^\"].

You're not taking alt attributes into account. You only match title attributes, not alt attributes.

A working example:

// Ditch new Regex("...") in favour of /.../ because it is simpler.
var $regex = /<a\shref="(#\d+|(https?|ftp):\/\/[^"]+)"(\stitle="[^"]+")?(\salt="[^"]+")?|<\/a>/;

var $test = new Array();
$test[0] = '<a href="http://www.nytimes.com/imagepages/2010/09/02/us/HURRICANE.html"&gt;';
$test[1] = '<a href="http://www.msnbc.msn.com/id/38877306/ns/weather/%29;"&gt;';
$test[2] = '<a href="http://www.msnbc.msn.com/id/38927104" title="dd" alt="dd">';
for(var i = 0; i < $test.length; i++)
{
    console.log($test[i]);
    console.log($regex.test($test[i]));
}

All three examples match this regex.

molf