url regex issues

tags:

regex
url

views:

answers:

+1 Q:

I'm using this regex (((ht|f)tp(s?))\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+\.(com|edu|gov|mil|net|org|biz|info|name|museum|us|ca|uk)(\:[0-9]+)*(/($|[a-zA-Z0-9\.\,\;\?\'\\\+&%\$#\=~_\-]+))* to search for urls, the only problem, is it's finding "you ca" is a url, how do I change it so there HAS to be a period before the ending (in this case the 'ca') so 'you ca' wont work anymore but 'you.ca' will

+1 A:

You forgot to escape the periods in the (www.|[a-zA-Z].) block.

zigdon 2010-08-10 23:01:42

How does that have anything to do with the `\.` block before the `(com|edu...` block?

JGB146 2010-08-10 23:08:40

I dont know much about regexes, how would I escape them?

Patrick Gates 2010-08-10 23:20:26

Add a \ before the periods in that block.

zigdon 2010-08-13 00:04:35

+3 A:

Parsing uris with regexes is a hard problem.

Either use a library like Regexp::Common::URI or prepare to spend lots of time investigating a bunch of RFCs. Parsing URIs is entirely not trivial and there are lots of subtle mistakes to be made.

szbalint 2010-08-10 23:02:02

You can use a quantifier for the period character, so '\.{1}' would require exactly one period before whatever follows.

It's not something that's a necessary part of the debugging of this problem, but it may help to know about it. It's just more explicit, and '{1}' is bigger than a dot, so it also serves as a separator in long, ugly regexes where, during debugging, you might accidentally throw a "+" or "*" next to the dot.

jonesy 2010-08-10 23:03:22

How is that different from '\.'?

zigdon 2010-08-13 00:03:53

+1 for discovering not one, but two uses for `{1}`. :D I still can't see myself ever using it, though; the clutter it adds to the regex cancels out whatever benefit it brings, in my opinion.

Alan Moore 2010-08-16 13:08:33

I use a freeware to check my regex: http://www.weitz.de/regex-coach/

perhaps it can be helpfull to you

Norbert de Langen 2010-08-11 00:31:01

John Gruber's regexp is the best so far in my experience at finding URLs. See his article on his blog: An Improved Liberal, Accurate Regex Pattern for Matching URLs. It's in use in lots of production code. There's two version: one matches any URL while another only matches http/https URLs.

slebetman 2010-08-11 01:01:49

ansaurus

tags:

views:

answers:

url regex issues

related questions