tags:

views:

52

answers:

5
+1  Q: 

url regex issues

I'm using this regex (((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))* to search for urls, the only problem, is it's finding "you ca" is a url, how do I change it so there HAS to be a period before the ending (in this case the 'ca') so 'you ca' wont work anymore but 'you.ca' will

+1  A: 

You forgot to escape the periods in the (www.|[a-zA-Z].) block.

zigdon
How does that have anything to do with the `\.` block before the `(com|edu...` block?
JGB146
I dont know much about regexes, how would I escape them?
Patrick Gates
Add a \ before the periods in that block.
zigdon
+3  A: 

Parsing uris with regexes is a hard problem.

Either use a library like Regexp::Common::URI or prepare to spend lots of time investigating a bunch of RFCs. Parsing URIs is entirely not trivial and there are lots of subtle mistakes to be made.

szbalint
A: 

You can use a quantifier for the period character, so '\.{1}' would require exactly one period before whatever follows.

It's not something that's a necessary part of the debugging of this problem, but it may help to know about it. It's just more explicit, and '{1}' is bigger than a dot, so it also serves as a separator in long, ugly regexes where, during debugging, you might accidentally throw a "+" or "*" next to the dot.

jonesy
How is that different from '\.'?
zigdon
+1 for discovering not one, but two uses for `{1}`. :D I still can't see myself ever using it, though; the clutter it adds to the regex cancels out whatever benefit it brings, in my opinion.
Alan Moore
A: 

I use a freeware to check my regex: http://www.weitz.de/regex-coach/

perhaps it can be helpfull to you

Norbert de Langen
A: 

John Gruber's regexp is the best so far in my experience at finding URLs. See his article on his blog: An Improved Liberal, Accurate Regex Pattern for Matching URLs. It's in use in lots of production code. There's two version: one matches any URL while another only matches http/https URLs.

slebetman