I'm using this regex (((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))*
to search for urls, the only problem, is it's finding "you ca" is a url, how do I change it so there HAS to be a period before the ending (in this case the 'ca') so 'you ca' wont work anymore but 'you.ca' will
views:
52answers:
5Parsing uris with regexes is a hard problem.
Either use a library like Regexp::Common::URI or prepare to spend lots of time investigating a bunch of RFCs. Parsing URIs is entirely not trivial and there are lots of subtle mistakes to be made.
You can use a quantifier for the period character, so '\.{1}' would require exactly one period before whatever follows.
It's not something that's a necessary part of the debugging of this problem, but it may help to know about it. It's just more explicit, and '{1}' is bigger than a dot, so it also serves as a separator in long, ugly regexes where, during debugging, you might accidentally throw a "+" or "*" next to the dot.
I use a freeware to check my regex: http://www.weitz.de/regex-coach/
perhaps it can be helpfull to you
John Gruber's regexp is the best so far in my experience at finding URLs. See his article on his blog: An Improved Liberal, Accurate Regex Pattern for Matching URLs. It's in use in lots of production code. There's two version: one matches any URL while another only matches http/https URLs.