views:

362

answers:

3

I've taken the Liberal URL Regex from Daring Fireball, merged it with some of Alan Storm improvements and hacked my way into fixing some bugs like support for IDN chars inside parentheses. This is what I've:

/(?:[\w-]+:\/\/?|www[.])[^\s()<>]+(?:(?:\([^\s()<>]*\)[^\s()<>]*)+|[^[:punct:]\s]|\/)/

However I've encountered a bug that I'm not being able to solve:

'www.dsd(sd)sdsd.com' // can also be the valid 'www.dsd.com/whatever(whatever)'

The above URL is being recognized as www.dsd(sd)sdsd.com' (or www.dsd.com/whatever(whatever)') instead of www.dsd(sd)sdsd.com (or www.dsd.com/whatever(whatever)). This only seems to happen when the URL has parentheses, since the following URL:

'www.sampleurl.com'

Is correctly being recognized as www.sampleurl.com.

I think the [^[:punct:]\s]|\/ part of the regex is not being executed when the URL has parentheses, I've been trying for some time but I can't seem to find a solution. Can anyone help me?

For commodity, I've set up a Rubular permalink with the regex and some test data (the last URL fails).


I think that Gruber's regex was a little rushed, for instance it doesn't match URL's like:

http://en.wikipedia.org/wiki/Something_(Special)_For_You

I'm even more impressed by seeing that both Gruber and Alan missed this really simple typo:

\([\w\d]+\)

Wouldn't \(\w+\) be enough? :S

+1  A: 

www.dsd(sd)sdsd.com is not a valid domain name.

If you had 'www.dsd.com/whatever(whatever)', it would be recognized correctly. (Or at least is in my tests)

Joel L
Also doesn't seem to work (http://www.rubular.com/regexes/12851).
Alix Axel
Hm, true. I tested using the original Daring Fireball expression (that I use myself). I'm not a regex expert, so pending any other solution, I would remove Alan Storm's improvements (because I believe they are useless/unnecessary)
Joel L
The Daring Fireball expression only matches 0-9a-Z inside parentheses.
Alix Axel
www.url.com/something(xyz) ... the parens part is how cookieless sessions correlate the key in ASP.NET.
Nissan Fan
+1  A: 
 /(?:[\w-]+:\/\/?|www[.])[^\s()<>]+(?:(?:\([^\s()<>]*\)[^\s()<>]*)+|[^[:punct:]\s]|\/)/
  www.                   |               |            |
                          dsd            |            |
                                          (sd)        |
                                                       sdsd.com'

That's how I think this breaks down... the bit of the regex above (sd) starts with an escaped open paren, then a stared char class matching sd, then an escaped closing paren, and the next thing is [^\s()<>]* which matches sdsd.com'.

Michał Marczyk
+2  A: 

Seems like Gruber has revised his regular expression:

\b((?:[a-z][\w-]+:(?:\/{1,3}|[a-z0-9%])|www\d{0,3}[.])(?:[^\s()<>]+|\([^\s()<>]+\))+(?:\([^\s()<>]+\)|[^`!()\[\]{};:'".,<>?«»“”‘’\s]))

Works just fine now.

Alix Axel